Phenotyping of Cervical Cancer Risk Groups via Generalized Low-Rank Models Using Medical Questionnaires

Becker, Florian; Nygård, Mari; Nygård, Jan; Smilde, Age; Acar, Evrim

doi:10.1007/978-3-031-17030-0_8

Florian Becker^9,10,
Mari Nygård¹¹,
Jan Nygård¹¹,
Age Smilde^9,12 &
…
Evrim Acar⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1650))

Included in the following conference series:

Symposium of the Norwegian AI Society

2342 Accesses
1 Citations

Abstract

The purpose of this study is to uncover cervical cancer (CC) risk phenotypes from self-reported lifestyle questionnaires and screening data. In general, computational phenotype discovery aims to find subgroups among individuals that share distinctive characteristics by analyzing electronic health records (EHR). This can benefit the understanding of a disease as well as uncover risk factors and provide possibilities for preventive action. The features in the women ($n = 6359$) by questionnaire features ($p=29$) matrix with missing data are of different statistical data types (e.g., binary or ordinal data). We use so-called generalized low-rank models (GLRM) that can address this challenge via different statistical-data-type-dependent loss functions. We show that these models can uncover phenotypes related to cervical cancer risk factors from large-scale questionnaire data.

You have full access to this open access chapter, Download conference paper PDF

Matrix factorization for the reconstruction of cervical cancer screening histories and prediction of future screening results

Article Open access 16 November 2022

Evaluating the risk of endometriosis based on patients’ self-assessment questionnaires

Article Open access 28 October 2023

Type II combination questionnaire model: A new survey design for a totally sensitive binary variable correlated with another nonsensitive binary variable

Article 26 January 2015

Keywords

1 Introduction

The collection and processing of electronic health records (EHR) has the potential to increase the quality of care and diagnostic value [1, 2]. EHR may include, for instance, the medical history, medication, demographics, or other personal or lifestyle meta information. Questionnaires or surveys are one way to gather information about lifestyle choices that might serve as complementary information to other EHR data. Anticipating the adoption of EHR, suitable data mining methods are needed to analyze EHR data and uncover different patient subgroups or phenotypes [3]. By design, questionnaires typically include aspects that are assumed or known to be risk factors. The incidence of a disease can be compared by conducting hypothesis tests between different predefined groups, for instance, between smokers and non-smokers. However, in order to uncover previously unknown phenotypes of patients, unsupervised multivariate approaches are needed. Low-rank matrix factorizations, such as principal component analysis (PCA) [4, 5] or nonnegative matrix factorization (NMF) [6], are promising tools to analyze multivariate data and reveal underlying patterns in an unsupervised way. These approaches have the advantage of not presuming any kind of groups. Thus, they may allow to discover patient phenotypes and co-factors, i.e., features that co-occur.

However, missing data and different statistical data types of feature columns are challenging problems when analyzing heterogeneous (questionnaire) data. Generalized low-rank models (GLRM) provide a promising framework that was developed recently to address these challenges [7, 8]. In this context, generalization stands for the extension of losses beyond the standard quadratic loss. GLRM approximates a heterogeneous data matrix using low-rank score and loading matrices by taking into account the statistical data type of each column. We investigate this idea, and explore whether there is a benefit for computational phenotyping compared to an NMF-based model agnostic to data types.

1.1 Cervical Cancer Screening Programme

Since establishing a coordinated nationwide cervical cancer screening programme in Norway in 1995 the incidence of the disease was substantially reduced [9]. In addition to collecting the screening results, the Cancer Registry Norway sent out a questionnaire to roughly 30,000 women in 2004–2005 and 2011–2012 [10, 11]. It comprises questions about lifestyle choices such as drinking and smoking habits as well as questions about contraception usage, sexual activity (e.g., number of sexual partners) and previous history of sexually transmitted diseases (STDs), among others. Together with the screening results from a cytology examination, this data set can provide researchers as well as medical practitioners with valuable insights about demographics, disease progression and patient phenotypes. The complete screening history of a woman f can be denoted by $\{(s_i,\, d_i) \}_{i=1}^{n_f}$, where $s_i$ is the age at the i-th screening, $d_i$ is the associated examination result, encoded by diagnosis codes (see Table 1, Appendix) and $n_f$ is the total number of screenings for f. The (cytological or histological) examination results range from no atypical cells to different categorizations of pre-cancers and cancers. While the screening data is a population-level data set, the questionnaire data covers only a sub-population.

1.2 Uncovering Phenotypes and Co-Factors is Ongoing Research

While it is known that the human papillomavirus (HPV) causes nearly all cervical cancer cases, different risk factors for such an infection and their interaction among each other are still a relevant topic. Previous studies and reviews have identified various factors that increase the risk of cervical cancer, e.g., the duration of hormonal contraception [12] or the marital status [13]. Early age at first intercourse as well as early pregnancies have been determined to be risk factors in developing countries [14]. A further study has proposed a model according to which the incidence rate of cervical cancer is proportional to the square of time since first intercourse [15]. Some factors, such as smoking, have been identified as co-factors, meaning that it increases the cervical cancer risk among HPV positive women [16]. In order to reveal these statistical associations, studies typically use uni- or bivariate tests [17]. However, to uncover more complex phenotypes, multivariate approaches are needed.

In this study, we use GLRMs to analyze a large-scale medical questionnaire data set linked with screening data, and show that GLRMs are a viable method for phenotype discovery in the context of cervical cancer risk groups. We demonstrate that when GLRMs are used to analyze questionnaire data in the form of a female participants by features matrix, meaningful phenotypes showing statistically significant differences between risk-level subgroups are revealed. One phenotype, for instance, is characterized by the number of sexual partners as well as hormonal contraception usage. Some extracted phenotypes are consistent across models using different number of components. Grouping women based on a phenotype description can potentially be used in the future to personalize cervical cancer screening programs. The ultimate goal is to avoid both too infrequent screening and over-screening. While low-rank models have been used previously for phenotyping EHR data [1, 2, 18, 19] primarily focusing on the analysis of medication, procedure and diagnosis data, the multivariate analysis of self-reported medical questionnaires to reveal phenotypes remains an under-researched and challenging problem.

This study, to the best of our knowledge, presents the first attempt to discover phenotypes from survey data that was collected within a cervical cancer screening programme, using NMF as well as a low-rank model with data-type-specific loss functions.

2 Materials and Methods

2.1 Questionnaire Description and Preprocessing

The aspects that are covered in the questionnaire can be roughly grouped into nine categories: contraception, awareness of HPV, smoking, drinking habits, sexual activity, pregnancies, previous STDs and other personal information like marital status and education. The answers to these questions have different statistical data types. A question of Boolean type, for example, asks for whether a person smokes, while a further question asks for the age when the person started (or stopped) smoking. In addition to this categorization of feature columns according to their statistical data type, the features can also be categorized according to their static or dynamic nature. Static features, once reported, do not change over time (e.g., if hormonal contraception was ever used before), while dynamic features (e.g., the number of years of smoking) are time-dependent. To a certain extent, the design of the questionnaire allows to associate the questionnaire features with screening results.

By recording the starting age of a certain habit or the onset of a certain kind of contraception use, the time since the starting age can be computed at a certain later screening time point $s_i$.

For each screening $s_i$, a subset of the questionnaire features are transformed such that they denote durations or “time since onset”. These features are also called delta-time features, and the prefix dt_ is used to denote them. Delta features allow examination results $d_1,\dots , d_{n_f}$ to be associated with questionnaire feature vectors.

Transformed features can only be computed if the starting point for a certain habit lies in the past, given a certain screening time point $s_i$. Questionnaire feature rows that do not fulfill this condition are discarded.

To arrive at the final questionnaire data, the feature vector corresponding to the worst screening result (diagnosis codes in ascending order, cf. Table 1 in the Appendix) for each female participant is extracted. Rows and feature columns in the data set that contained more than $50\%$ missing values were discarded. For example, questions about different STDs (e.g., chlamydia, gonorrhea) were only answered by relatively few women. The final features that were included in the analysis are shown in Table 3, Appendix. Screening results are heavily skewed towards normals. In order to prevent any low-rank model to primarily model the normal group, only a randomly sampled subset of normals is used. The distribution of risk-level categories in the final matrix in the form of a women ($n=6359$) by questionnaire items/features ($p=29$) matrix is shown in Table 1, Appendix.

2.2 Generalized Low-Rank Models

Notation: Scalars are denoted as lowercase letters, vectors as boldface lowercase letters, and matrices as boldface uppercase letters. By $x_{ij}$ we denote the (i, j) entry of a matrix $\textbf{X}$. We use $\textbf{x}_{i:}$ to denote the ith row and $\textbf{x}_{:j}$ to denote jth column of an $n \times p$ matrix $\textbf{X}$. We treat both $\textbf{x}_{i:}$ and $\textbf{x}_{:j}$ as column vectors.

We use generalized low-rank models to approximate the heterogeneous survey data matrix $\textbf{Q}$ $\in \mathbb {R}^{n \times p}$ using low-rank female-mode matrix $\textbf{X}\in \mathbb {R}^{n \times k}$ and a phenotype matrix $\textbf{Y}\in \mathbb {R}^{k \times p}$ with k factors, where k is often much smaller than $\texttt {min}(n,p)$. In contrast to data matrix $\textbf{Q}$, factor matrices $\textbf{X}$ and $\textbf{Y}$ are real-valued. The factor matrices are computed by solving the following optimization problem:

$$\begin{aligned}&\!\min _{\textbf{X}, \textbf{Y}}&\sum _{(i,j) \in \varOmega } \mathcal {L}_{j}(q_{ij}, \textbf{x}_{i:}^\top \textbf{y}_{:j}) / \sigma _j^2 + \lambda _r \mathcal {R}_r(\textbf{X}) + \lambda _c \mathcal {R}_c(\textbf{Y}) \\&\text {s.t.}&\textbf{X}\ge 0, \, \textbf{Y}\ge 0, \nonumber \end{aligned}$$

(1)

where $\varOmega $ is the set of observed entries, $\mathcal {L}_j : (\mathbb {R} \times \mathbb {R}) \rightarrow \mathbb {R}$ denotes the entry-wise loss function that is dependent on the statistical data type of the respective column in $\textbf{Q}$, and $\textbf{X}\ge 0$ indicates that all matrix entries are nonnegative. To balance the unequal scaling across different columns, $\sigma _{j}^{2} =\frac{1}{n_{j}-1} \sum _{i:(i, j) \in \varOmega } \mathcal L_{j}\left( \mu _{j}, q_{i j}\right) $ is introduced, where $ \mu _{j} ={{\text {argmin}}}_{\mu } \sum _{i:(i, j) \in \varOmega } \mathcal L_{j}\left( \mu , q_{i j}\right) $ which is a generalization of the variance that is dependent on the loss function, where $n_j$ denotes the number of non-missing entries in column j. This means that scaling is not a preprocessing step, instead in order to scale the columns, a small optimization problem needs to be solved to get the $\{\mu _{j}\}_{j=1}^{p}$, which are then used to compute $\{\sigma ^2_{j}\}_{j=1}^{p}$. The $\{\mu _{j}\}_{j=1}^{p}$ itself are not used in the optimization problem (1), i.e., the columns are only scaled, but not centered. $\mathcal {R}_r(\textbf{X}) = \sum _{i=1}^{n} r_i(\textbf{x}_{i:})$ and $\mathcal {R}_c(\textbf{X}) = \sum _{j=1}^{p} r_j(\textbf{y}_{:j})$ denote regularization terms across rows and columns, denoted by the subscripts r and c, respectively. We use the $\ell _1$-norm, i.e., $r_i(\textbf{x}_{i:}) = || \textbf{x}_{i:} ||_1 = \sum _{j=1}^k \left| x_{ij}\right| $ to enforce sparsity across the rows of $\textbf{X}$ and columns of $\textbf{Y}$. The reasons for using sparsity are two-fold: Sparsity enforces clustering [20, 21] and (together with non-negativity) a less-arbitrary, more well-posed solution of the optimization problem above. In general, low-rank models are non-convex. Missing data exacerbate the problem of non-convexity and lead to more local minima [22]. Note that the formulation above does not incorporate a weight matrix. Instead, the set $\varOmega $ contains indices of all available data in $\textbf{Q}$. An equivalent formulation is to use a binary weight matrix that encodes missing and non-missing data.

Low-rank approximations have been extended beyond the minimization of the quadratic loss in the past, e.g., to model Poisson or Bernoulli-distributed data [23]. The framework used in this study, however, facilitates the use of different loss functions as well as imposing constraints on the factors through regularization. Constraints play a crucial role in matrix factorizations since additional constraints are often needed to reveal unique patterns (that can be further interpreted as, e.g., phenotypes, biomarkers). The framework has been used before to investigate autism spectrum disorder phenotypes using hospitalization records [7].

3 Experiments

We assess the performance of a GLRM-based model in terms of revealing phenotypes from the questionnaire data matrix $\textbf{Q}$. Our results demonstrate that GLRM can reveal phenotypes showing statistically significant differences between cervical cancer risk groups. We also show that both GLRM and an NMF-based model find similar general risk factors using a 4-component model. However, when high number of components is used to reveal more phenotypes, GLRM uncovers more phenotypes that are both statistically significant and consistent.

3.1 Implementation Details and Experimental Set-Up

In order to solve the optimization problem given in (1), we use the Julia package LowRankModels.jl that fits low-rank models using an alternating proximal gradient method [8]. We extended this framework to fit our needs. For instance, we implemented a Kullback-Leibler divergence loss function $\mathcal {L}_{\text {KL}}$ for count data (cf. Table 2 in the Appendix). To avoid local minima, we use 50 random initializations and the one returning the minimum loss is used. We also validate the uniqueness of $\textbf{X}$ and $\textbf{Y}$ experimentally by assessing solutions from multiple runs, making sure that factor matrices corresponding to the minimum function values are the same (visually).

In this study, two types of models are used: The one that is defined by the optimization problem (1) using different loss-functions $\mathcal L_j$, and a second one, a naïve counterpart, that uses the same constraints and regularization, but only uses a quadratic loss function across all feature columns. Hence, the second type is nonnegative matrix factorization with additional $\ell _1$ regularization considered as the naive counterpart of the GLRM. In the following, we use the abbreviation GLRM to refer to the tailored model with statistical data-type-dependent loss functions, and NMF to a nonnegative matrix factorization model with $\ell _1$ regularization. We explored different regularization parameters for the sparsity regularization, i.e., $\lambda _r, \lambda _c \in \{0.1,1,5,10\}$, and observed that $\lambda _r = \lambda _c = 1$ yields sparse and significant phenotypes. Increasing the regularization parameters further yielded phenotypes that were sparser but with fewer significant subgroups.

3.2 Model Selection

One way to determine the appropriate number of components for each model is to use the imputation error. Furthermore, the imputation error allows us to compare different models [8]. For each $k-$rank model for $k \in \{1,\dots , 16\}$, 25 different sets of held-out values are sampled. By computing corresponding GLRM and NMF models for each of the $\{\, \textbf{Q}^{\text {miss}}_{i} \, \}_{i=1}^{25}$, held-out values are estimated, and reconstruction error statistics are computed. We use $15\%$ missing values for each $\textbf{Q}^{\text {miss}}_i$. Both the median of the imputation error, as well as the whole spread need to be taken into consideration. These statistics show the generalization performance and can be used to select a model. Refer to [8] for more information about how to compute imputation errors for mixed statistical data types.

Prior to building the final models, outliers are removed via the leverage score [24] given by $\textbf{h} = {\text {diag}}(\textbf{X}\left( \textbf{X}^{\top } \textbf{X}\right) ^{-1} \textbf{X}^{\top })$, using the corresponding score matrices from the best-performing models in terms of the imputation error. Data points with a leverage score above the $99\%$ quantile were removed (less than 50 subjects for both NMF and GLRM). The model selection process was repeated after the outliers were discarded.

3.3 General Cervical Cancer Risk Factors

For a first exploratory analysis, we investigate the imputation errors of GLRM and NMF in order to perform the model selection procedure described above. Figure 1 shows the imputation errors for $k \in \{1,\dots , 16\}$. NMF models outperform GLRM for $k \in \{1,\dots , 9\}$. After this range, the imputation error of NMF has high variation while GLRM is stable. In the range $k \in \{2,\dots , 9\}$, the imputation error for both NMF and GLRM does not change much. We pick $k=4$ since both models achieve almost the smallest error for this rank. For a 4-component model, GLRM and NMF are close with respect to their imputation error, and there are some similarities in their latent features.

Figure 2 shows the corresponding score matrices $\textbf{X}_{\text {nmf}}$ and $\textbf{X}_{\text {glrm}}$, as well as the features matrices $\textbf{Y}_{\text {nmf}}$ and $\textbf{Y}_{\text {glrm}}$ for a 4-component model. The scores are arranged according to the risk groups which is indicated by colors (green for normal, yellow for low-grade, red for high-grade, gray for cancer). Furthermore, the higher-risk groups in the figures are over-represented (cf. Table 1, Appendix) in order to compensate for the skewness of the risk-group distribution. The horizontal line within each risk group shows the mean.

Interested in whether there is a difference between different diagnosis groups, especially between normals and low-grade/high-grade risk groups, we perform unpaired t-tests for each risk group within each component. This means that, for instance, for the first component, we perform a t-test between normals vs. ASCUS low-grade, normals vs. LSLIL (low-grade), normals vs. ASC-H (high-grade), and so on. In this way, the components that capture meaningful subgroups on the basis of which different risk groups might be separated can be determined. In this study, we focus only on the components that show statistical significance in terms of group difference between normals vs. all other groups. In Fig. 2 statistical significance is indicated by using gray or blue colored bars for $\textbf{Y}$. Blue bars indicate that the differences between normals and all other risk groups for the corresponding component are all statistically significant, i.e., for all six t-tests, we found p-value $\le 0.05/b_k$, where $b_k=6k$ is a Bonferroni correction that is applied for each $k-$component model, and takes into account all significance tests performed. Components that exhibit significant differences for each of the six tests will be called significant components in the following. Gray bars in Fig. 2 indicate that there is at least one risk-group within one component with non-significant result.

There are phenotypes that reflect higher-risk groups. Consider, for instance, the fourth component of the GLRM model, $c^{\, \text {glrm}}_4$: There are recognizably lower values for the normal diagnosis group (green) compared with higher risk groups (yellow, red, gray). The phenotype is mostly characterized by hormonal contraception usage, which is known to be a risk factor. Thus, it can be assumed that this component models a general risk group. This means that the latent feature space hints to risk factors. For each GLRM component, there exits one (arguably sufficiently similar) corresponding NMF component. For instance, $c^{\, \text {glrm}}_1$ corresponds to $c^{\, \text {nmf}}_1$, and shows a phenotype mainly defined by the features age_partner and age. For GLRM, the hormonal contraception subgroup ($c^{\, \text {glrm}}_4)$ shows significance between all pairwise $t-$tests, while this is not the case for the corresponding NMF subgroup. Summarizing, GLRM uncovers one more significant subgroup than NMF. Maybe surprisingly, a simple NMF model together with $\ell _1$ regularization can find very similar subgroups.

3.4 Phenotypes for Higher Number of Components

Increasing the number of components and inspecting the corresponding models beyond what is shown in Fig. 2 might reveal other subgroups of interest. Investigating higher ranks is necessary because there are, by design, already (at least) nine categories of questions in the questionnaire. As we described earlier, these are related to: contraception, awareness of HPV, smoking, drinking habits, sexual activity, pregnancies, previous STDs and other personal information like marital status and education. Only a model with higher rank can extract or separate these subgroups, especially as phenotypes might also be characterized by a combination of features from different categories. While the imputation error is stable for GLRM for higher ranks $k \in \{8,\dots , 16\}$, it is increasing for NMF. Several models with different number of components are considered in order to assess the sensitivity of the model to the number of components, and consistency of the components interpreted as phenotypes. We inspect the models for $k \in \{ 7,8,9,10 \}$ (see Figs. 3 and 4). The figure uses $\{c^{\, \text {glrm}}_1,\dots ,c^{\, \text {glrm}}_{10} \}$ to denote the different components. Note that components from different models were grouped together based on the cosine similarity. This means that, for instance, $c^{\, \text {glrm}}_3$ only contains components from a model with $k=9$ and $k=10$, while a corresponding component for $k=8$ and $k=7$ does not exist. Thus, $\{c^{\, \text {glrm}}_1,\dots ,c^{\, \text {glrm}}_{10} \}$ have to be understood as a way to name different subgroups and not as an enumeration of components.

There are two important and general observations about the latent feature space. First, some related features are also grouped together within components. Features that are most distinct in components $c^{\, \text {glrm}}_7$, for instance, are related to sexual habits. Second, there are many components that are consistent across models with different number of components. We say that two or more components from different $k-$component models are consistent with respect to some subgroup if there is a consensus between their most important feature weights. In some cases, phenotypes are characterized by very few prevalent features that are related, e.g., the hormonal contraception/condom subgroup $c^{\, \text {glrm}}_9$. An example for a phenotype that is consistent for all four models, is the age_partner $+$ age ($c^{\, \text {glrm}}_4$). We use the label complex phenotype to denote a subgroup that is characterized by features from more than two categories.

Besides showing the phenotypes, Figs. 3 and 4 also display the Bonferroni-adjusted statistically significant subgroups (i.e., $p \le 0.05/b_k$). Within each component, we indicate statistical significance by using either filled or unfilled bars: If every risk-group (from low-risk to cancer) deviates significantly from the normal group, the corresponding bars are colored, otherwise only the edges are shown.

Components that are consistently visible for different number of components, k, and have statistically significant deviations between normals and every other risk-groups provide strong evidence for a meaningful phenotype within the questionnaire data. Important phenotypes uncovered by GLRM are for instance $c^{\, \text {glrm}}_{9}$ (hormonal contraception, condom, number of partners) or $c^{\, \text {glrm}}_4$ (age of first sexual partner + age). Figure 4 shows the phenotypes for NMF.

4 Discussion

For a 4-component model, NMF and GLRM both uncover phenotypes related to hormonal contraception, age + age of first sexual partner, and a complex phenotype that has a similar profile (with the exception of num_partners). The subsequent analysis using higher-rank models with $k \in \{ 7,8,9,10 \}$ suggests that using loss functions that match the data type are better suited for phenotype discovery than using standard quadratic loss functions. GLRM uncovers more phenotypes compared to NMF. Furthermore, we observe that component $c^{\, \text {glrm}}_{9}$ shows that GLRM is able to reveal a significant subgroup that is mainly defined by two binary variables: hormon_contr and condom. Some components show that relating features are grouped together within components, e.g., $c^{\, \text {glrm}}_3$ (contraception + sexual habits) or $c^{\, \text {glrm}}_{9}$ (hormonal contraception, condom, number of partners).

Grouping of related features, consistency between different k-rank models, expert knowledge and significance between risk-levels provide evidence that (generalized) low-rank models can uncover important phenotypes. By design, the questionnaire mainly contains items that are known to be important risk factors. However, the results in this study show that significant components or subgroups that are defined by multivariate features exist. A subgroup that is found by both GLRM and NMF, as well as across different k-rank models within both models is the phenotype that is characterized by the age of the female participant as well as the age of the first sexual partner.

Some phenotypes that are defined by one or few very dominant features align with the literature on cervical cancer risk factors. The usage of hormonal contraception ($c^{\, \text {glrm}}_5$), especially when used for long durations, is linked with increased risk of cervical cancer [12, 25]. The number of sexual partners is another well-known and important risk factor [26,27,28], and is for instance reflected by component ($c^{\, \text {glrm}}_{9}$). Component $c^{\, \text {glrm}}_{3}$ and especially component $c^{\, \text {nmf}}_{3}$ group the number of sexual partners and the history of genital warts together which has been found previously [29]. Time since first intercourse [15] is a further contributing risk factor ($c^{\, \text {glrm}}_{9}$). Our analysis suggests that investigating models with higher components uncovers important features and phenotypes that are not present for lower-rank models. For example, the binary feature hpv, which stands for knowledge about HPV, only appears in $c^{\, \text {glrm}}_{10}$ in a pronounced way.

Using the score matrices $\textbf{X}_{\text {nmf}}$ and $\textbf{X}_{\text {glrm}}$ from all previously discussed models, we tried to find clusters, e.g., by using k-means clustering of all possible subspaces, defined by the columns of the score matrices. No distinct clusters were found that reflect the different risk-levels, which is probably due to the uniform effect: k-means clusters tend to have uniform sizes and hence cannot capture imbalanced risk-levels [30]. We assume that it is not possible to find distinct, non-overlapping clusters just based on questionnaire data, as the within-risk-level variation is too large. However, our results indicate that it is possible to uncover certain tendencies of risk-level groups.

5 Future Work

Validating phenotypes based on unpaired t-tests between risk-level groups is a limitation as differences in the means might constitute a necessary but not sufficient condition for the clinical meaningfulness of a phenotype. Testing the validity of phenotypes, i.e., their significance in a clinical context, is a challenge that might be adequately addressed by methods from survival analysis [1, 31]. In survival analysis, the time until an event (‘hazard’) occurs is studied. In our context, this time span could be defined as the time between the completion of the questionnaire and a high-grade risk result. Different phenotypes can be evaluated with respect to their hazard times which in turn can serve as a proxy to evaluate clinical significance. Figure 5 depicts an exemplary pipeline that uses a low-rank model to compute (sparse) phenotypes that are then examined by survival analysis. Such a pipeline could uncover the important phenotypes and questions and could be beneficial for personalizing cervical cancer screening programs, in order to find a better balance between too infrequent screening and over-screening.

6 Conclusion

In this study, (generalized) low-rank models were used for computational phenotype discovery in questionnaires that were sent out to gather meta data within the Norwegian cervical cancer screening programme. We used two decomposition methods, one that is agnostic to different data types and one that considers the different statistical data types via appropriate loss functions. Our results indicate that the careful construction of models that were tailored to the data types was worthwhile and revealed more significant phenotypes compared to the naïve counterpart. Discovering clinically-meaningful phenotypes helps to identify risk groups that are characterized by a combination of features. Phenotypes in the Norwegian questionnaire data related to the age of the first sexual partner, hormonal contraception, number of sexual partners and contraception usage, among others were identified.

References

Perros, I., Papalexakis, E.E., Vuduc, R., Searles, E., Sun, J.: Temporal phenotyping of medically complex children via PARAFAC2 tensor factorization. J. Biomed. Inform. 93, 103125 (2019)
Article Google Scholar
Joshi, S., Gunasekar, S., Sontag, D., Joydeep, G.: Identifiable phenotyping using constrained non-negative matrix factorization. In: Machine Learning for Healthcare Conference, pp. 17–41. PMLR (2016)
Google Scholar
Banda, J.M., Seneviratne, M., Hernandez-Boussard, T., Shah, N.H.: Advances in electronic phenotyping: from rule-based definitions to machine learning models. Annu. Rev. Biomed. Data Sci. 1, 53–68 (2018)
Article Google Scholar
Pearson, K.: On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)
Article MATH Google Scholar
Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24(6), 417 (1933)
Article MATH Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Article MATH Google Scholar
Schuler, A., et al.: Discovering patient phenotypes using generalized low rank models. In: Biocomputing 2016: Proceedings of the Pacific Symposium, pp. 144–155. World Scientific (2016)
Google Scholar
Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized low rank models. Found. Trends® Mach. Learn. 9(1), 1–118 (2016)
Google Scholar
Nygård, J., Skare, G., Thoresen, S.: The cervical cancer screening programme in Norway, 1992–2000: changes in pap smear coverage and incidence of cervical cancer. J. Med. Screen. 9(2), 86–91 (2002)
Article Google Scholar
Hansen, B.T., Campbell, S., Nygård, M.: Regional differences in cervical cancer incidence and associated risk behaviors among Norwegian women: a population-based study. BMC Cancer 21(1), 1–10 (2021)
Article Google Scholar
Hansen, B.T., Hukkelberg, S.S., Haldorsen, T., Eriksen, T., Skare, G.B., Nygård, M.: Factors associated with non-attendance, opportunistic attendance and reminded attendance to cervical screening in an organized screening program: a cross-sectional study of 12,058 Norwegian women. BMC Public Health 11(1), 1–13 (2011)
Article Google Scholar
Smith, J.S., et al.: Cervical cancer and use of hormonal contraceptives: a systematic review. Lancet 361(9364), 1159–1167 (2003)
Article Google Scholar
Sharma, P., Pattanshetty, S.M.: A study on risk factors of cervical cancer among patients attending a tertiary care hospital: a case-control study. Clin. Epidemiology Glob. Health 6(2), 83–87 (2018)
Article Google Scholar
Louie, K., et al.: Early age at first sexual intercourse and early pregnancy are risk factors for cervical cancer in developing countries. Br. J. Cancer 100(7), 1191–1197 (2009)
Article Google Scholar
Plummer, M., Peto, J., Franceschi, S., of Epidemiological studies of cervical cancer, I.C.: time since first sexual intercourse and the risk of cervical cancer. Int. J. Cancer 130(11), 2638–2644 (2012)
Google Scholar
Winkelstein JR, W.: Smoking and cervical cancer-current status: a review. Am. J. Epidemiol. 131(6), 945–957 (1990)
Google Scholar
Torres-Poveda, K., Ruiz-Fraga, I., Madrid-Marina, V., Chavez, M., Richardson, V.: High risk HPV infection prevalence and associated cofactors: a population-based study in female ISSSTE beneficiaries attending the HPV screening and early detection of cervical cancer program. BMC Cancer 19(1), 1–12 (2019)
Article Google Scholar
Ho, J.C., et al.: Limestone: high-throughput candidate phenotype generation via tensor factorization. J. Biomed. Inform. 52, 199–211 (2014)
Article Google Scholar
Ho, J.C., Ghosh, J., Sun, J.: Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In: KDD 2014: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 115–124 (2014)
Google Scholar
Papalexakis, E.E., Sidiropoulos, N.D., Bro, R.: From k-means to higher-way co-clustering: multilinear decomposition with sparse latent factors. IEEE Trans. Signal Process. 61(2), 493–506 (2012)
Article Google Scholar
Bro, R., Papalexakis, E.E., Acar, E., Sidiropoulos, N.D.: Coclustering-a useful tool for chemometrics. J. Chemom. 26(6), 256–263 (2012)
Article Google Scholar
Srebro, N., Jaakkola, T.: Weighted low-rank approximations. In: ICML 2003: Proceedings of the 20th International Conference on Machine Learning, pp. 720–727 (2003)
Google Scholar
Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of principal component analysis to the exponential family. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, pp. 617–624. MIT Press (2001)
Google Scholar
Bro, R.: PARAFAC. Tutorial and applications. Chemom. Intell. Lab. Syst. 38(2), 149–171 (1997)
Google Scholar
Cibula, D., et al.: Hormonal contraception and risk of cancer. Hum. Reprod. Update 16(6), 631–650 (2010)
Article Google Scholar
Liu, Z.C., Liu, W.D., Liu, Y.H., Ye, X.H., Chen, S.D.: Multiple sexual partners as a potential independent risk factor for cervical cancer: a meta-analysis of epidemiological studies. Asian Pac. J. Cancer Prev. 16(9), 3893–3900 (2015)
Article Google Scholar
Jensen, K.E., et al.: Women’s sexual behavior. population-based study among 65 000 women from four nordic countries before introduction of human papillomavirus vaccination. Acta Obstetricia et Gynecologica Scandinavica 90(5), 459–467 (2011)
Google Scholar
Hansen, B.T., et al.: Age at first intercourse, number of partners and sexually transmitted infection prevalence among Danish, Norwegian and Swedish women: estimates and trends from nationally representative cross-sectional surveys of more than 100 000 women. Acta Obstet. Gynecol. Scand. 99(2), 175–185 (2020)
Article Google Scholar
Kjær, S.K., et al.: The burden of genital warts: a study of nearly 70,000 women from the general female population in the 4 Nordic countries. J. Infect. Dis. 196(10), 1447–1454 (2007)
Article Google Scholar
Xiong, H., Wu, J., Chen, J.: K-means clustering versus validation measures: a data-distribution perspective. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(2), 318–331 (2008)
Google Scholar
Bewick, V., Cheek, L., Ball, J.: Statistics review 12: survival analysis. Crit. Care 8(5), 1–6 (2004)
Article Google Scholar

Download references

Acknowledgement

This work is part of the DeCipher project that is funded by the Research Council of Norway.

Author information

Authors and Affiliations

Simula Metropolitan Center for Digital Engineering, Oslo, Norway
Florian Becker, Age Smilde & Evrim Acar
Oslo Metropolitan University, Oslo, Norway
Florian Becker
Cancer Registry of Norway, Oslo, Norway
Mari Nygård & Jan Nygård
Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, Netherlands
Age Smilde

Authors

Florian Becker
View author publications
You can also search for this author in PubMed Google Scholar
Mari Nygård
View author publications
You can also search for this author in PubMed Google Scholar
Jan Nygård
View author publications
You can also search for this author in PubMed Google Scholar
Age Smilde
View author publications
You can also search for this author in PubMed Google Scholar
Evrim Acar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Becker .

Editor information

Editors and Affiliations

Department of Mechanical, Electronics, and Chemical Engineering, Oslo Metropolitan University, Oslo, Norway
Evi Zouganeli
Department of Computer Science, Oslo Metropolitan University, Oslo, Norway
Anis Yazidi
Department of Computer Science, Oslo Metropolitan University, Oslo, Norway
Gustavo Mello
Department of Computer Science, Oslo Metropolitan University, Oslo, Norway
Pedro Lind

Appendix

Table 1. Cervical cancer risk levels (cytology). The count column indicates the number of women in the corresponding risk-level group in the final data matrix. Diagnoses AGUS and ACIS are in the same high-grade(2) risk-level group.

Full size table

Table 2. The four different loss functions (logistic loss, quadratic loss Kullback-Leibler divergence, ordinal hinge loss) that are used in posing the GLRM problem. For the ordinal loss function, d refers to the number of options for the corresponding question.

Full size table

Table 3. Summary of included features from the questionnaire, grouped by their statistical data type. Abbreviations used within this table; w: week, m: month, CCS: cervical cancer screening. For delta-time features (dt_), t stands for ‘time since’.

Full size table

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Becker, F., Nygård, M., Nygård, J., Smilde, A., Acar, E. (2022). Phenotyping of Cervical Cancer Risk Groups via Generalized Low-Rank Models Using Medical Questionnaires. In: Zouganeli, E., Yazidi, A., Mello, G., Lind, P. (eds) Nordic Artificial Intelligence Research and Development. NAIS 2022. Communications in Computer and Information Science, vol 1650. Springer, Cham. https://doi.org/10.1007/978-3-031-17030-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-17030-0_8
Published: 02 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17029-4
Online ISBN: 978-3-031-17030-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Phenotyping of Cervical Cancer Risk Groups via Generalized Low-Rank Models Using Medical Questionnaires

Abstract

Similar content being viewed by others

Matrix factorization for the reconstruction of cervical cancer screening histories and prediction of future screening results

Evaluating the risk of endometriosis based on patients’ self-assessment questionnaires

Type II combination questionnaire model: A new survey design for a totally sensitive binary variable correlated with another nonsensitive binary variable

Keywords