Keywords

1 Introduction

The collection and processing of electronic health records (EHR) has the potential to increase the quality of care and diagnostic value [1, 2]. EHR may include, for instance, the medical history, medication, demographics, or other personal or lifestyle meta information. Questionnaires or surveys are one way to gather information about lifestyle choices that might serve as complementary information to other EHR data. Anticipating the adoption of EHR, suitable data mining methods are needed to analyze EHR data and uncover different patient subgroups or phenotypes [3]. By design, questionnaires typically include aspects that are assumed or known to be risk factors. The incidence of a disease can be compared by conducting hypothesis tests between different predefined groups, for instance, between smokers and non-smokers. However, in order to uncover previously unknown phenotypes of patients, unsupervised multivariate approaches are needed. Low-rank matrix factorizations, such as principal component analysis (PCA) [4, 5] or nonnegative matrix factorization (NMF) [6], are promising tools to analyze multivariate data and reveal underlying patterns in an unsupervised way. These approaches have the advantage of not presuming any kind of groups. Thus, they may allow to discover patient phenotypes and co-factors, i.e., features that co-occur.

However, missing data and different statistical data types of feature columns are challenging problems when analyzing heterogeneous (questionnaire) data. Generalized low-rank models (GLRM) provide a promising framework that was developed recently to address these challenges [7, 8]. In this context, generalization stands for the extension of losses beyond the standard quadratic loss. GLRM approximates a heterogeneous data matrix using low-rank score and loading matrices by taking into account the statistical data type of each column. We investigate this idea, and explore whether there is a benefit for computational phenotyping compared to an NMF-based model agnostic to data types.

1.1 Cervical Cancer Screening Programme

Since establishing a coordinated nationwide cervical cancer screening programme in Norway in 1995 the incidence of the disease was substantially reduced [9]. In addition to collecting the screening results, the Cancer Registry Norway sent out a questionnaire to roughly 30,000 women in 2004–2005 and 2011–2012 [10, 11]. It comprises questions about lifestyle choices such as drinking and smoking habits as well as questions about contraception usage, sexual activity (e.g., number of sexual partners) and previous history of sexually transmitted diseases (STDs), among others. Together with the screening results from a cytology examination, this data set can provide researchers as well as medical practitioners with valuable insights about demographics, disease progression and patient phenotypes. The complete screening history of a woman f can be denoted by \(\{(s_i,\, d_i) \}_{i=1}^{n_f}\), where \(s_i\) is the age at the i-th screening, \(d_i\) is the associated examination result, encoded by diagnosis codes (see Table 1, Appendix) and \(n_f\) is the total number of screenings for f. The (cytological or histological) examination results range from no atypical cells to different categorizations of pre-cancers and cancers. While the screening data is a population-level data set, the questionnaire data covers only a sub-population.

1.2 Uncovering Phenotypes and Co-Factors is Ongoing Research

While it is known that the human papillomavirus (HPV) causes nearly all cervical cancer cases, different risk factors for such an infection and their interaction among each other are still a relevant topic. Previous studies and reviews have identified various factors that increase the risk of cervical cancer, e.g., the duration of hormonal contraception [12] or the marital status [13]. Early age at first intercourse as well as early pregnancies have been determined to be risk factors in developing countries [14]. A further study has proposed a model according to which the incidence rate of cervical cancer is proportional to the square of time since first intercourse [15]. Some factors, such as smoking, have been identified as co-factors, meaning that it increases the cervical cancer risk among HPV positive women [16]. In order to reveal these statistical associations, studies typically use uni- or bivariate tests [17]. However, to uncover more complex phenotypes, multivariate approaches are needed.

In this study, we use GLRMs to analyze a large-scale medical questionnaire data set linked with screening data, and show that GLRMs are a viable method for phenotype discovery in the context of cervical cancer risk groups. We demonstrate that when GLRMs are used to analyze questionnaire data in the form of a female participants by features matrix, meaningful phenotypes showing statistically significant differences between risk-level subgroups are revealed. One phenotype, for instance, is characterized by the number of sexual partners as well as hormonal contraception usage. Some extracted phenotypes are consistent across models using different number of components. Grouping women based on a phenotype description can potentially be used in the future to personalize cervical cancer screening programs. The ultimate goal is to avoid both too infrequent screening and over-screening. While low-rank models have been used previously for phenotyping EHR data [1, 2, 18, 19] primarily focusing on the analysis of medication, procedure and diagnosis data, the multivariate analysis of self-reported medical questionnaires to reveal phenotypes remains an under-researched and challenging problem.

This study, to the best of our knowledge, presents the first attempt to discover phenotypes from survey data that was collected within a cervical cancer screening programme, using NMF as well as a low-rank model with data-type-specific loss functions.

2 Materials and Methods

2.1 Questionnaire Description and Preprocessing

The aspects that are covered in the questionnaire can be roughly grouped into nine categories: contraception, awareness of HPV, smoking, drinking habits, sexual activity, pregnancies, previous STDs and other personal information like marital status and education. The answers to these questions have different statistical data types. A question of Boolean type, for example, asks for whether a person smokes, while a further question asks for the age when the person started (or stopped) smoking. In addition to this categorization of feature columns according to their statistical data type, the features can also be categorized according to their static or dynamic nature. Static features, once reported, do not change over time (e.g., if hormonal contraception was ever used before), while dynamic features (e.g., the number of years of smoking) are time-dependent. To a certain extent, the design of the questionnaire allows to associate the questionnaire features with screening results.

By recording the starting age of a certain habit or the onset of a certain kind of contraception use, the time since the starting age can be computed at a certain later screening time point \(s_i\).

For each screening \(s_i\), a subset of the questionnaire features are transformed such that they denote durations or “time since onset”. These features are also called delta-time features, and the prefix dt_ is used to denote them. Delta features allow examination results \(d_1,\dots , d_{n_f}\) to be associated with questionnaire feature vectors.

Transformed features can only be computed if the starting point for a certain habit lies in the past, given a certain screening time point \(s_i\). Questionnaire feature rows that do not fulfill this condition are discarded.

To arrive at the final questionnaire data, the feature vector corresponding to the worst screening result (diagnosis codes in ascending order, cf. Table 1 in the Appendix) for each female participant is extracted. Rows and feature columns in the data set that contained more than \(50\%\) missing values were discarded. For example, questions about different STDs (e.g., chlamydia, gonorrhea) were only answered by relatively few women. The final features that were included in the analysis are shown in Table 3, Appendix. Screening results are heavily skewed towards normals. In order to prevent any low-rank model to primarily model the normal group, only a randomly sampled subset of normals is used. The distribution of risk-level categories in the final matrix in the form of a women (\(n=6359\)) by questionnaire items/features (\(p=29\)) matrix is shown in Table 1, Appendix.

2.2 Generalized Low-Rank Models

Notation: Scalars are denoted as lowercase letters, vectors as boldface lowercase letters, and matrices as boldface uppercase letters. By \(x_{ij}\) we denote the (ij) entry of a matrix \(\textbf{X}\). We use \(\textbf{x}_{i:}\) to denote the ith row and \(\textbf{x}_{:j}\) to denote jth column of an \(n \times p\) matrix \(\textbf{X}\). We treat both \(\textbf{x}_{i:}\) and \(\textbf{x}_{:j}\) as column vectors.

We use generalized low-rank models to approximate the heterogeneous survey data matrix \(\textbf{Q}\) \(\in \mathbb {R}^{n \times p}\) using low-rank female-mode matrix \(\textbf{X}\in \mathbb {R}^{n \times k}\) and a phenotype matrix \(\textbf{Y}\in \mathbb {R}^{k \times p}\) with k factors, where k is often much smaller than \(\texttt {min}(n,p)\). In contrast to data matrix \(\textbf{Q}\), factor matrices \(\textbf{X}\) and \(\textbf{Y}\) are real-valued. The factor matrices are computed by solving the following optimization problem:

$$\begin{aligned}&\!\min _{\textbf{X}, \textbf{Y}}&\sum _{(i,j) \in \varOmega } \mathcal {L}_{j}(q_{ij}, \textbf{x}_{i:}^\top \textbf{y}_{:j}) / \sigma _j^2 + \lambda _r \mathcal {R}_r(\textbf{X}) + \lambda _c \mathcal {R}_c(\textbf{Y}) \\&\text {s.t.}&\textbf{X}\ge 0, \, \textbf{Y}\ge 0, \nonumber \end{aligned}$$
(1)

where \(\varOmega \) is the set of observed entries, \(\mathcal {L}_j : (\mathbb {R} \times \mathbb {R}) \rightarrow \mathbb {R}\) denotes the entry-wise loss function that is dependent on the statistical data type of the respective column in \(\textbf{Q}\), and \(\textbf{X}\ge 0\) indicates that all matrix entries are nonnegative. To balance the unequal scaling across different columns, \(\sigma _{j}^{2} =\frac{1}{n_{j}-1} \sum _{i:(i, j) \in \varOmega } \mathcal L_{j}\left( \mu _{j}, q_{i j}\right) \) is introduced, where \( \mu _{j} ={{\text {argmin}}}_{\mu } \sum _{i:(i, j) \in \varOmega } \mathcal L_{j}\left( \mu , q_{i j}\right) \) which is a generalization of the variance that is dependent on the loss function, where \(n_j\) denotes the number of non-missing entries in column j. This means that scaling is not a preprocessing step, instead in order to scale the columns, a small optimization problem needs to be solved to get the \(\{\mu _{j}\}_{j=1}^{p}\), which are then used to compute \(\{\sigma ^2_{j}\}_{j=1}^{p}\). The \(\{\mu _{j}\}_{j=1}^{p}\) itself are not used in the optimization problem (1), i.e., the columns are only scaled, but not centered. \(\mathcal {R}_r(\textbf{X}) = \sum _{i=1}^{n} r_i(\textbf{x}_{i:})\) and \(\mathcal {R}_c(\textbf{X}) = \sum _{j=1}^{p} r_j(\textbf{y}_{:j})\) denote regularization terms across rows and columns, denoted by the subscripts r and c, respectively. We use the \(\ell _1\)-norm, i.e., \(r_i(\textbf{x}_{i:}) = || \textbf{x}_{i:} ||_1 = \sum _{j=1}^k \left| x_{ij}\right| \) to enforce sparsity across the rows of \(\textbf{X}\) and columns of \(\textbf{Y}\). The reasons for using sparsity are two-fold: Sparsity enforces clustering [20, 21] and (together with non-negativity) a less-arbitrary, more well-posed solution of the optimization problem above. In general, low-rank models are non-convex. Missing data exacerbate the problem of non-convexity and lead to more local minima [22]. Note that the formulation above does not incorporate a weight matrix. Instead, the set \(\varOmega \) contains indices of all available data in \(\textbf{Q}\). An equivalent formulation is to use a binary weight matrix that encodes missing and non-missing data.

Low-rank approximations have been extended beyond the minimization of the quadratic loss in the past, e.g., to model Poisson or Bernoulli-distributed data [23]. The framework used in this study, however, facilitates the use of different loss functions as well as imposing constraints on the factors through regularization. Constraints play a crucial role in matrix factorizations since additional constraints are often needed to reveal unique patterns (that can be further interpreted as, e.g., phenotypes, biomarkers). The framework has been used before to investigate autism spectrum disorder phenotypes using hospitalization records [7].

3 Experiments

We assess the performance of a GLRM-based model in terms of revealing phenotypes from the questionnaire data matrix \(\textbf{Q}\). Our results demonstrate that GLRM can reveal phenotypes showing statistically significant differences between cervical cancer risk groups. We also show that both GLRM and an NMF-based model find similar general risk factors using a 4-component model. However, when high number of components is used to reveal more phenotypes, GLRM uncovers more phenotypes that are both statistically significant and consistent.

3.1 Implementation Details and Experimental Set-Up

In order to solve the optimization problem given in (1), we use the Julia package LowRankModels.jl that fits low-rank models using an alternating proximal gradient method [8]. We extended this framework to fit our needs. For instance, we implemented a Kullback-Leibler divergence loss function \(\mathcal {L}_{\text {KL}}\) for count data (cf. Table 2 in the Appendix). To avoid local minima, we use 50 random initializations and the one returning the minimum loss is used. We also validate the uniqueness of \(\textbf{X}\) and \(\textbf{Y}\) experimentally by assessing solutions from multiple runs, making sure that factor matrices corresponding to the minimum function values are the same (visually).

In this study, two types of models are used: The one that is defined by the optimization problem (1) using different loss-functions \(\mathcal L_j\), and a second one, a naïve counterpart, that uses the same constraints and regularization, but only uses a quadratic loss function across all feature columns. Hence, the second type is nonnegative matrix factorization with additional \(\ell _1\) regularization considered as the naive counterpart of the GLRM. In the following, we use the abbreviation GLRM to refer to the tailored model with statistical data-type-dependent loss functions, and NMF to a nonnegative matrix factorization model with \(\ell _1\) regularization. We explored different regularization parameters for the sparsity regularization, i.e., \(\lambda _r, \lambda _c \in \{0.1,1,5,10\}\), and observed that \(\lambda _r = \lambda _c = 1\) yields sparse and significant phenotypes. Increasing the regularization parameters further yielded phenotypes that were sparser but with fewer significant subgroups.

3.2 Model Selection

One way to determine the appropriate number of components for each model is to use the imputation error. Furthermore, the imputation error allows us to compare different models [8]. For each \(k-\)rank model for \(k \in \{1,\dots , 16\}\), 25 different sets of held-out values are sampled. By computing corresponding GLRM and NMF models for each of the \(\{\, \textbf{Q}^{\text {miss}}_{i} \, \}_{i=1}^{25}\), held-out values are estimated, and reconstruction error statistics are computed. We use \(15\%\) missing values for each \(\textbf{Q}^{\text {miss}}_i\). Both the median of the imputation error, as well as the whole spread need to be taken into consideration. These statistics show the generalization performance and can be used to select a model. Refer to [8] for more information about how to compute imputation errors for mixed statistical data types.

Fig. 1.
figure 1

Imputation error statistics for \(k \in \{1,\dots , 16\}\). The imputation error is mean-normalized within each feature and by the number of held-out values.

Prior to building the final models, outliers are removed via the leverage score [24] given by \(\textbf{h} = {\text {diag}}(\textbf{X}\left( \textbf{X}^{\top } \textbf{X}\right) ^{-1} \textbf{X}^{\top })\), using the corresponding score matrices from the best-performing models in terms of the imputation error. Data points with a leverage score above the \(99\%\) quantile were removed (less than 50 subjects for both NMF and GLRM). The model selection process was repeated after the outliers were discarded.

3.3 General Cervical Cancer Risk Factors

For a first exploratory analysis, we investigate the imputation errors of GLRM and NMF in order to perform the model selection procedure described above. Figure 1 shows the imputation errors for \(k \in \{1,\dots , 16\}\). NMF models outperform GLRM for \(k \in \{1,\dots , 9\}\). After this range, the imputation error of NMF has high variation while GLRM is stable. In the range \(k \in \{2,\dots , 9\}\), the imputation error for both NMF and GLRM does not change much. We pick \(k=4\) since both models achieve almost the smallest error for this rank. For a 4-component model, GLRM and NMF are close with respect to their imputation error, and there are some similarities in their latent features.

Figure 2 shows the corresponding score matrices \(\textbf{X}_{\text {nmf}}\) and \(\textbf{X}_{\text {glrm}}\), as well as the features matrices \(\textbf{Y}_{\text {nmf}}\) and \(\textbf{Y}_{\text {glrm}}\) for a 4-component model. The scores are arranged according to the risk groups which is indicated by colors (green for normal, yellow for low-grade, red for high-grade, gray for cancer). Furthermore, the higher-risk groups in the figures are over-represented (cf. Table 1, Appendix) in order to compensate for the skewness of the risk-group distribution. The horizontal line within each risk group shows the mean.

Fig. 2.
figure 2

Left side of each plot shows a subsample of \(\textbf{X}_{\text {nmf}}\) and \(\textbf{X}_{\text {glrm}}\), respectively. The right side shows the latent features, \(\textbf{Y}_{\text {nmf}}\) and \(\textbf{Y}_{\text {glrm}}\). All factor matrices \(\textbf{X}\), \(\textbf{Y}\) are normalized by the norm of their columns and rows, respectively. \(c_1\), ..., \(c_4\) denote components. Colors indicate corresponding diagnosis groups: green: normal, yellow: low risk, red: high-risk, gray: cancer. Blue bars for \(\textbf{Y}_{\text {nmf}}\) and \(\textbf{Y}_{\text {glrm}}\) indicate that the differences between normals and all other risk groups are significant while gray bars indicate that there is at least one subgroup that is non-significant. See Table 3 for a description of the features. (Color figure online)

Interested in whether there is a difference between different diagnosis groups, especially between normals and low-grade/high-grade risk groups, we perform unpaired t-tests for each risk group within each component. This means that, for instance, for the first component, we perform a t-test between normals vs. ASCUS low-grade, normals vs. LSLIL (low-grade), normals vs. ASC-H (high-grade), and so on. In this way, the components that capture meaningful subgroups on the basis of which different risk groups might be separated can be determined. In this study, we focus only on the components that show statistical significance in terms of group difference between normals vs. all other groups. In Fig. 2 statistical significance is indicated by using gray or blue colored bars for \(\textbf{Y}\). Blue bars indicate that the differences between normals and all other risk groups for the corresponding component are all statistically significant, i.e., for all six t-tests, we found p-value \(\le 0.05/b_k\), where \(b_k=6k\) is a Bonferroni correction that is applied for each \(k-\)component model, and takes into account all significance tests performed. Components that exhibit significant differences for each of the six tests will be called significant components in the following. Gray bars in Fig. 2 indicate that there is at least one risk-group within one component with non-significant result.

There are phenotypes that reflect higher-risk groups. Consider, for instance, the fourth component of the GLRM model, \(c^{\, \text {glrm}}_4\): There are recognizably lower values for the normal diagnosis group (green) compared with higher risk groups (yellow, red, gray). The phenotype is mostly characterized by hormonal contraception usage, which is known to be a risk factor. Thus, it can be assumed that this component models a general risk group. This means that the latent feature space hints to risk factors. For each GLRM component, there exits one (arguably sufficiently similar) corresponding NMF component. For instance, \(c^{\, \text {glrm}}_1\) corresponds to \(c^{\, \text {nmf}}_1\), and shows a phenotype mainly defined by the features age_partner and age. For GLRM, the hormonal contraception subgroup (\(c^{\, \text {glrm}}_4)\) shows significance between all pairwise \(t-\)tests, while this is not the case for the corresponding NMF subgroup. Summarizing, GLRM uncovers one more significant subgroup than NMF. Maybe surprisingly, a simple NMF model together with \(\ell _1\) regularization can find very similar subgroups.

3.4 Phenotypes for Higher Number of Components

Increasing the number of components and inspecting the corresponding models beyond what is shown in Fig. 2 might reveal other subgroups of interest. Investigating higher ranks is necessary because there are, by design, already (at least) nine categories of questions in the questionnaire. As we described earlier, these are related to: contraception, awareness of HPV, smoking, drinking habits, sexual activity, pregnancies, previous STDs and other personal information like marital status and education. Only a model with higher rank can extract or separate these subgroups, especially as phenotypes might also be characterized by a combination of features from different categories. While the imputation error is stable for GLRM for higher ranks \(k \in \{8,\dots , 16\}\), it is increasing for NMF. Several models with different number of components are considered in order to assess the sensitivity of the model to the number of components, and consistency of the components interpreted as phenotypes. We inspect the models for \(k \in \{ 7,8,9,10 \}\) (see Figs. 3 and 4). The figure uses \(\{c^{\, \text {glrm}}_1,\dots ,c^{\, \text {glrm}}_{10} \}\) to denote the different components. Note that components from different models were grouped together based on the cosine similarity. This means that, for instance, \(c^{\, \text {glrm}}_3\) only contains components from a model with \(k=9\) and \(k=10\), while a corresponding component for \(k=8\) and \(k=7\) does not exist. Thus, \(\{c^{\, \text {glrm}}_1,\dots ,c^{\, \text {glrm}}_{10} \}\) have to be understood as a way to name different subgroups and not as an enumeration of components.

There are two important and general observations about the latent feature space. First, some related features are also grouped together within components. Features that are most distinct in components \(c^{\, \text {glrm}}_7\), for instance, are related to sexual habits. Second, there are many components that are consistent across models with different number of components. We say that two or more components from different \(k-\)component models are consistent with respect to some subgroup if there is a consensus between their most important feature weights. In some cases, phenotypes are characterized by very few prevalent features that are related, e.g., the hormonal contraception/condom subgroup \(c^{\, \text {glrm}}_9\). An example for a phenotype that is consistent for all four models, is the age_partner \(+\) age (\(c^{\, \text {glrm}}_4\)). We use the label complex phenotype to denote a subgroup that is characterized by features from more than two categories.

Fig. 3.
figure 3

Normalized \(\textbf{Y}_{\text {glrm}}\) components \(\{c^{\, \text {glrm}}_1,\dots ,c^{\, \text {glrm}}_{10} \}\) for models with \(k \in \{7,8,9,10 \}\). Different colored bars indicate factors from the different models. Filled bars correspond to significance between all risk-levels for a certain subgroup

Fig. 4.
figure 4

Normalized \(\textbf{Y}_{\text {nmf}}\) components \(\{c^{\, \text {nmf}}_1,\dots ,c^{\, \text {nmf}}_{10} \}\) for models with \(k \in \{7,8,9,10 \}\).

Besides showing the phenotypes, Figs. 3 and 4 also display the Bonferroni-adjusted statistically significant subgroups (i.e., \(p \le 0.05/b_k\)). Within each component, we indicate statistical significance by using either filled or unfilled bars: If every risk-group (from low-risk to cancer) deviates significantly from the normal group, the corresponding bars are colored, otherwise only the edges are shown.

Components that are consistently visible for different number of components, k, and have statistically significant deviations between normals and every other risk-groups provide strong evidence for a meaningful phenotype within the questionnaire data. Important phenotypes uncovered by GLRM are for instance \(c^{\, \text {glrm}}_{9}\) (hormonal contraception, condom, number of partners) or \(c^{\, \text {glrm}}_4\) (age of first sexual partner + age). Figure 4 shows the phenotypes for NMF.

4 Discussion

For a 4-component model, NMF and GLRM both uncover phenotypes related to hormonal contraception, age + age of first sexual partner, and a complex phenotype that has a similar profile (with the exception of num_partners). The subsequent analysis using higher-rank models with \(k \in \{ 7,8,9,10 \}\) suggests that using loss functions that match the data type are better suited for phenotype discovery than using standard quadratic loss functions. GLRM uncovers more phenotypes compared to NMF. Furthermore, we observe that component \(c^{\, \text {glrm}}_{9}\) shows that GLRM is able to reveal a significant subgroup that is mainly defined by two binary variables: hormon_contr and condom. Some components show that relating features are grouped together within components, e.g., \(c^{\, \text {glrm}}_3\) (contraception + sexual habits) or \(c^{\, \text {glrm}}_{9}\) (hormonal contraception, condom, number of partners).

Grouping of related features, consistency between different k-rank models, expert knowledge and significance between risk-levels provide evidence that (generalized) low-rank models can uncover important phenotypes. By design, the questionnaire mainly contains items that are known to be important risk factors. However, the results in this study show that significant components or subgroups that are defined by multivariate features exist. A subgroup that is found by both GLRM and NMF, as well as across different k-rank models within both models is the phenotype that is characterized by the age of the female participant as well as the age of the first sexual partner.

Some phenotypes that are defined by one or few very dominant features align with the literature on cervical cancer risk factors. The usage of hormonal contraception (\(c^{\, \text {glrm}}_5\)), especially when used for long durations, is linked with increased risk of cervical cancer [12, 25]. The number of sexual partners is another well-known and important risk factor [26,27,28], and is for instance reflected by component (\(c^{\, \text {glrm}}_{9}\)). Component \(c^{\, \text {glrm}}_{3}\) and especially component \(c^{\, \text {nmf}}_{3}\) group the number of sexual partners and the history of genital warts together which has been found previously [29]. Time since first intercourse [15] is a further contributing risk factor (\(c^{\, \text {glrm}}_{9}\)). Our analysis suggests that investigating models with higher components uncovers important features and phenotypes that are not present for lower-rank models. For example, the binary feature hpv, which stands for knowledge about HPV, only appears in \(c^{\, \text {glrm}}_{10}\) in a pronounced way.

Using the score matrices \(\textbf{X}_{\text {nmf}}\) and \(\textbf{X}_{\text {glrm}}\) from all previously discussed models, we tried to find clusters, e.g., by using k-means clustering of all possible subspaces, defined by the columns of the score matrices. No distinct clusters were found that reflect the different risk-levels, which is probably due to the uniform effect: k-means clusters tend to have uniform sizes and hence cannot capture imbalanced risk-levels [30]. We assume that it is not possible to find distinct, non-overlapping clusters just based on questionnaire data, as the within-risk-level variation is too large. However, our results indicate that it is possible to uncover certain tendencies of risk-level groups.

5 Future Work

Validating phenotypes based on unpaired t-tests between risk-level groups is a limitation as differences in the means might constitute a necessary but not sufficient condition for the clinical meaningfulness of a phenotype. Testing the validity of phenotypes, i.e., their significance in a clinical context, is a challenge that might be adequately addressed by methods from survival analysis [1, 31]. In survival analysis, the time until an event (‘hazard’) occurs is studied. In our context, this time span could be defined as the time between the completion of the questionnaire and a high-grade risk result. Different phenotypes can be evaluated with respect to their hazard times which in turn can serve as a proxy to evaluate clinical significance. Figure 5 depicts an exemplary pipeline that uses a low-rank model to compute (sparse) phenotypes that are then examined by survival analysis. Such a pipeline could uncover the important phenotypes and questions and could be beneficial for personalizing cervical cancer screening programs, in order to find a better balance between too infrequent screening and over-screening.

Fig. 5.
figure 5

Pipeline from a low-rank model to personalized screening.

6 Conclusion

In this study, (generalized) low-rank models were used for computational phenotype discovery in questionnaires that were sent out to gather meta data within the Norwegian cervical cancer screening programme. We used two decomposition methods, one that is agnostic to different data types and one that considers the different statistical data types via appropriate loss functions. Our results indicate that the careful construction of models that were tailored to the data types was worthwhile and revealed more significant phenotypes compared to the naïve counterpart. Discovering clinically-meaningful phenotypes helps to identify risk groups that are characterized by a combination of features. Phenotypes in the Norwegian questionnaire data related to the age of the first sexual partner, hormonal contraception, number of sexual partners and contraception usage, among others were identified.