1 Introduction

In visually oriented specialized medical domains such as dermatology and radiology, physicians explore interesting image cases from medical image repositories for comparative case studies to aid clinical diagnoses, educate medical trainees, and support medical research. This image browsing and lookup could benefit from a grouping of medical images that is consistent with experts’ understanding of the image content. However, it is challenging, because medical image interpretation usually requires domain knowledge that tends to be tacit. Therefore, to make expertise more explicit we propose an interactive machine learning paradigm that has experts in the loop to improve image grouping.

Fig. 1
figure 1

Overview of the flow chart of our expert-in-the-loop paradigm. An expert encodes domain knowledge as special constraints through rounds of interactions

1.1 Contributions

The contributions of this paper involves many areas, including interaction-based visual analytics, knowledge discovery through interactions, user modeling during interactions, and human-centered computing.

  • Interaction-based visual analytics Existing systems that allow interactive user visual analysis usually adopt topic modeling techniques [26]. Original features are reduced to a lower-dimensional topic space, in which documents are grouped. One type of such system, including UTOPIAN [3] and iVisClustering [6], visualizes the topics, so that users can adjust the topic-term distribution at the term granularity. In contrast, our paradigm focuses experts on natural high-level image grouping tasks and encodes expert image manipulations as constraints to improve the overall image grouping. Moreover, in our domain the objects for experts to interact with are medical images rather than latent topics, which may be confusing to the experts. Another type of system, including LSAView [4] and iVisClassifier [2], involves document-level interactions. These systems require users to change the parameters of the algorithms. In contrast, our system updates the underlying topic model based on experts’ natural manipulations of the images.

  • Knowledge discovery through interactions There are many existing visual analytic applications whose purpose is for data exploration and summarization [79]. The visualized data clusters can be easily interpreted. For knowledge discovery purposes, there are also applications in the domains such as geography [10], whose outcome is also straightforward. Our paradigm is presented in medical domain where the understanding and interpretations are difficult. We elicit and use the knowledge and expertise of the medical end users through interactions.

  • User modeling during interactions Similar to the ReGroup system to interactively tailor its suggestions [11], our paradigm allows the model and the user to learn from each other. As with more and more interactions between an expert user and the learning algorithm, the underlying model is gradually adapted to the user mental model (how she groups the images and what her standards are). The model records her personalized considerations during the task. Having different users in the loop results in different outputs. We seamlessly integrate machine learning and an adaptive user interaction mechanism to collect the most useful information that is complementary to the limited data issue.

  • Human-centered computing The loop requires both the computational strength of machine learning algorithms and the domain knowledge from the experts. The experts are given high-level and natural tasks, and local changes made by the experts can cause global updates of the underlying model. The global constraint is to make sure that the learned hidden topics can be used to best recover the observed data points whereas the local constraints come from the expert input. The balance is achieved through the interactive process when the machine and expert finally agree upon each other’s decision.

Fig. 2
figure 2

Image features filtered by an expert’s eye gaze. (Image courtesy of Logical Images, Inc.). a A sample image, b SIFT features, c eye gaze, d filtered SIFT

1.2 Roadmap

In order to minimize human efforts and provide experts with a good starting point to group images, we create an initial image grouping using a multimodal expert dataset described in Sect. 2 [12]. This initial image grouping is learned through a multimodal data fusion algorithm flexible to incorporate new images [13]. From here, the loop to improve image grouping begins (see Fig. 1). An expert can inspect the image grouping and choose to improve it through an interface. Specifically, she encodes her domain knowledge about the medical images by grouping a small subset of images. Our learning algorithm automatically incorporates these manually specified connections as constraints for reorganizing the whole image dataset. The rules by which the interface parses expert inputs as implicit constraints are described in Sect. 5. The incrementally reorganized image set is presented by the visualization techniques in Sect. 4. In this way, the computational evolution of an image grouping model, its visualization, and expert interactions form a loop to improve image grouping. We presented some preliminary result of this work in a conference [1]. This journal paper includes extensions, such as additional ways of constructing neighboring matrix W, a new way of updating the underlying model, a more comprehensive evaluation and discussions on the system responsiveness to expert inputs, scalability and other implementation considerations. The interface design and the supported expert image manipulations are presented in Sect. 3. An expert-in-the-loop evaluation study is described in Sect. 6.

2 Paradigm initialization

The initial image grouping was learned from an offline collected expert dataset. To elicit expert data, 16 physicians were asked to inspect 48 medical images and describe the image content aloud toward a diagnosis, as if teaching a student who was seated nearby [14]. Their eye movements were recorded, as eye movement features highlight perceptually important image regions, which is especially useful in knowledge-rich domains [15]. In this paper, we use experts’ eye fixation maps to filter image features (SIFT features [16]). See Fig. 2 for an example. A bag of visual words is created from the remaining image features, and each image is described by a histogram of the visual words. Physicians’ verbal image descriptions were also recorded concurrently, as they provide insights into experts’ diagnostic image understanding. Figure 3 shows a sample transcription. The medical concepts were extracted from the transcriptions using MetaMap, a medical language processing resource [17, 18]. These concepts formed a high-dimensional feature space, in which each image is described by the occurrences of these medical concepts.

Fig. 3
figure 3

A sample diagnostic narrative expressed by a dermatologist inspecting the medical image shown in Fig. 2a. (SIL) represents silent pause

Fig. 4
figure 4

Image grouping interface (details of algorithms behind this interface are described in Sects. 4 and 5): Panel 1-a visualizes the image grouping before each round of expert image manipulation, and panel 1-b visualizes the resulting image grouping afterward. Experts are allowed to select multiple images in (1-a) for manipulation. Panels 2-a and 2-b are matrix views corresponding to (1-a) and (1-b), respectively, to show global pairwise image similarities. A button set 3 pops up new windows (shown in Figs. 5678) to visualize image grouping initialized using various subsets of features, such as primary morphology terms (PRI). BOD stands for body parts, CD for correct diagnoses, and ET for eye gaze-filtered image features. Panel 4 allows experts to specify the direction to manipulate the selected images. Panel 5 lists the top key terms in each topic and allows experts to disconnect images from a topic; it can also be used to show experts the top related images for key term manipulations

We choose a Laplacian sparse coding approach [19, 20] over latent semantic analysis (LSA) or latent Dirichlet allocation (LDA). LSA does not perform as well as Laplacian sparse coding to cluster images by object [21]. LDA is affected not only by an initial specification but also by the samples randomly generated at each iteration [3]. It does not support users to make incremental changes, due to the inconsistent results obtained from multiple runs.

To initialize an image grouping based on the features extracted from multiple modalities, we adopt a data fusion framework based on Laplacian sparse coding [13]. The objective function is presented in Eq. (1). Matrices \(E\in \mathbb {R}^{n_e\times m}\) and \(V\in \mathbb {R}^{n_v\times m}\) are eye gaze-filtered image features and verbal features, respectively (\(n_e\) being the number of visual words, \(n_v\) being the number of verbal features, and m being the number of images). This model provides flexibility to allow extra data modalities by adding terms like the first two in Eq. (1). The coefficient matrix \(C\in \mathbb {R}^{k\times m}\) (k being the number of latent topics) stores the new image representations, each of which is a distribution of latent topics learned and stored in the basis matrices \(P\in \mathbb {R}^{n_e\times k}\) and \(Q\in \mathbb {R}^{n_v\times k}\). The matrices P and Q reveal the transformation from the original feature spaces to latent topics.

$$\begin{aligned} \min _{P, Q, C \ge 0} \Vert E-{PC}\Vert _F^2+\Vert V-{QC}\Vert _F^2+\alpha \mathcal {G}(W, C)+\beta \mathcal {S}(C) \end{aligned}$$
(1)

where \(\mathcal {S}(\cdot )\) represents a sparsity constraint (\(l_1\)-norm) and \(\mathcal {G}(\cdot , \cdot )\) represents a graph-regularizer. These constraints form the Laplacian sparse coding that helps capture underlying semantics behind observations in both modalities [20, 22]. W is a neighboring matrix that indicates similarities between pairs of data instances. A multimodal variation of the feature-sign search algorithm is developed to selectively update some elements of each data instance to tackle the non-derivativeness of the \(l_1\)-norm [22]. Since the sparse codes learned through general-purpose machine learning algorithms usually do not reflect ideal expert image understanding in a specific domain [23], we extend this framework with extra constraints from expert knowledge to improve semantic image representations.

3 Interface design

The initial image grouping purely based on offline collected expert data is first visualized in the Older Image Organization in Fig. 4 (panel 1-a) for experts to inspect and manipulate. In the case where domain expert users need further information on the current image grouping, we provide two extra visualizations. First, experts can see an image cluster and the top features contributing to this cluster (see Fig. 9). Second, experts can click the buttons in Fig. 4 (panel 3) to compare the image grouping obtained when using different subsets of features, such as only primary morphology termsFootnote 1 (Fig. 5), with that using the whole feature set, see Fig. 4 (panel 1-a).

Fig. 5
figure 5

Image groupings generated using primary morphology terms (PRI)

Experts have a few options to improve the image grouping in each round. First, they can directly drag images toward or apart from each other in Fig. 4 (panel 1-a). The system parses such expert inputs and incorporates them for updating the neighboring graph-based regularizer (see Sect. 5.1). Second, experts can select a topic from the listbox in Fig. 4 (panel 5) and indicate the least relevant image(s) according to the vocabulary distribution of the selected topic. Based on such expert inputs, the system updates the image-topic distribution matrix (see Sect. 5.2). Likewise, experts can select a topic and indicate the term(s), so that the system updates the topic-term distribution matrices (see Sect. 5.3). After experts interact with the interface using any option, the image grouping in the previous round is copied to Fig. 4 (panel 1-a), and the improved one is shown in Fig. 4 (panel 1-b). In each round, both image groupings are visualized following the approaches discussed in Sect. 4.

4 Visualizing image groups

To comprehensively visualize the image grouping, our interface presents both a graph view shown in Fig. 4 (panel 1) and a matrix view shown in Fig. 4 (panel 2). Both views are automatically updated during expert interactions.

In the graph view, we adopt the t-distributed stochastic neighborhood embedding (t-SNE) algorithm [24]. It better visualizes the high-dimensional structure of image grouping in 2D graph view than other dimensionality reduction techniques, such as principal component analysis (PCA) [3]. We use a distance metaphor to imply to experts that more similar images are spatially closer. However, this metaphor does not proportionally reflect all pairwise image similaritiesFootnote 2 in high-dimensional space, because of the difficulty to retain the whole data structure for any dimensionality reduction algorithms. To tackle this issue, our interface allows experts to see an image and its high-dimensional close neighbors in 2D visualization. The popup window visualizing these neighbors is illustrated in Fig. 9. Different neighbors may share different features with the selected image. We visualize these shared features to enhance expert users’ overall understanding of the image repository.

In order to help the expert users perceive how the underlying algorithm uses the concepts and features and how each subset of features contributes to the overall image grouping, we provide additional visualization functionality to allow them looking into various grouping results, e.g., based on primary morphology terms (PRI; Fig. 5), correct diagnosis terms (CD; Fig. 6), body part terms (BOD; Fig. 7), and image features filtered by eye-tracking features (ET; Fig. 8). The clusters result from face, leg, foot, etc. in Fig. 7 can be recognized easily. Clusters in Figs. 567, and 8 also make sense to experts. Since experts may have special preferences of features when grouping the images, such feature-specific visualizations are a foundation for interactive feature selection algorithms. More discussions are in Sect. 6.

The interface also presents a matrix view that serves to give an overview of the pairwise image similarities, because it is impractical that experts choose to see the close neighbors of all images in a 2D graph view. See Fig. 10 for a magnified matrix view. The matrix view provides a global indexing of pairwise image similarities in the learned representation.

Fig. 6
figure 6

Image groupings generated using correct diagnosis terms (CD)

Fig. 7
figure 7

Image groupings generated using body part terms (BOD)

Fig. 8
figure 8

Image groupings generated using image features filtered by eye-tracking features (ET)

Fig. 9
figure 9

An example visualization of an image and its high-dimensional close neighbors. The target image is shown in the upper left quarter, and its top 3 close neighbors in the learned topic space are visualized in other quarters. The shared verbal features are ranked by term frequency, and the top ones are listed below each corresponding neighbor. The shared perceptually important image features are also ranked, and the top ones are marked in both the target image and its neighbors. The colors of the markers differentiate the image pairs. (Images courtesy of Logical Images, Inc.)

5 Expert knowledge constraints

There are mainly two approaches in prior studies allowing user interactions to help improve learning a model: document-level interactions [3, 6], or topic/cluster-level interactions [2, 4]. In our scenario, to improve medical image grouping, the documents are images. To develop this interface, we prefer image document-level interactions for two reasons. On the one hand, the medical conditions are more intuitive in the form of images than texts to physicians. On the other hand, the topics we learned offline based on a multimodal expert dataset are not easily visualizable nor interpretable by physicians. Below are three functions in the interface for receiving expert inputs and updating the model, both to support image-level interactions.

Fig. 10
figure 10

An example of the matrix view. The intensity of each block represents the similarity between corresponding images. The darker the block is, the more similar the images are. For example, the similarity between the images on the right is indicated by the dark block circled in the matrix view on the left. (Image courtesy of Logical Images, Inc.)

5.1 Constraint on neighboring matrix, W

Let the images in the original feature space be denoted as \(\mathbf{x}_1\), ..., \(\mathbf{x}_m\). A nearest neighbor graph G with m vertices can be constructed. A heat kernel weighting can be used to compute the element \(W_{ij}\) in the neighboring matrix W of the graph G [25].

$$\begin{aligned} W_{ij}={\left\{ \begin{array}{ll} e^{-\frac{\Vert \mathbf{x}_i - \mathbf{x}_j\Vert }{\sigma }}, &{} \text {if }{} \mathbf{x}_i \text { and }{} \mathbf{x}_j\text { are close}\\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

where \(W_{ij}\) equals 1, if \(\mathbf{x}_i\) and \(\mathbf{x}_j\) are identical.

Similarly, other graph-weighting strategies can be adopted, such as 0–1 weighting in Eq. 3 and histogram intersection kernel weighting in Eq. 4, depending on the feature attributes. They can be used together to achieve superior clustering result [26].

$$\begin{aligned} W_{ij}= & {} {\left\{ \begin{array}{ll} 1, &{} \text {if }{} \mathbf{x}_i\text { and }{} \mathbf{x}_j\text { are close}\\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)
$$\begin{aligned} W_{ij}= & {} {\left\{ \begin{array}{ll} \sum _{d=1}^{D} \min (\mathbf{x}_{di}, \mathbf{x}_{dj}), &{} \text {if }\mathbf{x}_i\hbox { and }{} \mathbf{x}_j\hbox { are close}\\ &{} (\textit{d}\ \text {represents each feature}\\ &{}\quad \text {dimension)}\\ 0, &{} \text {otherwise} \end{array}\right. }\nonumber \\ \end{aligned}$$
(4)

The interface can encode expert image manipulations as a transformation of the neighboring matrix W. This transformation is determined by multiple factors, including previous image grouping and experts’ interpretation of it. The transformation of W can be simplified as \(\mathcal {F}(\cdot , \cdot )\) in Eq. (5) and be considered as a constraint set by experts to guide the learning process.

$$\begin{aligned}&\min _{P, Q, C \ge 0} \Vert E-{PC}\Vert _F^2+\Vert V-{QC}\Vert _F^2+\alpha \mathcal {G}(\mathcal {F}(W, K), C)\nonumber \\&\quad +\,\beta \mathcal {S}(C) \end{aligned}$$
(5)

where K denotes the set of images selected by an expert in Fig. 4 (panel 1-a). In this paper, we use hard constraints, i.e., by moving one image toward or away from another, experts can connect or disconnect them in the model. Such expert constraint essentially sets a boundary regarding pairwise image similarities. Once an expert begins to connect these images, the system sets all \(W_{ij}\)’s (\(i, j \in K\), \(i \ne j\)) to be 1. Likewise, \(W_{ij}\)’s (\(i, j \in K\), \(i \ne j\)) are all set to be 0, if the expert thinks they should be grouped differently. This rule is designed to update the neighboring matrix W in Eq. (2). Once all \(W_{ij}\)’s specified by the expert are updated, the algorithm will trigger the further learning process for the image representation C and the visual and verbal topics P and Q with respect to the objective function in Eq. (5).

5.2 Constraint on topic-coefficient matrix, C

Experts can also improve the image grouping through the task illustrated in Fig. 4 (panel 5). For each topic selected by experts in the listbox, its top terms in the topic-term distribution are listed. The list of top terms explains the gist of the topic to experts. The images that are considered highly relevant to the selected topic by the algorithm are then displayed at the bottom. The task for experts is to submit the least relevant image(s) to the topic to disconnect its/their link(s) to the topic. After experts have indicated the least relevant image(s), the system updates the coefficient matrix C according to the constraint in Eq. (6).

$$\begin{aligned}&\min _{P, Q, C \ge 0} \Vert E-{PC}\Vert _F^2+\Vert V-{QC}\Vert _F^2+\alpha \mathcal {G}(W, C)+\beta \mathcal {S}(C) \nonumber \\&\quad \text {s.t. }C_{ij} = 0, i \in T\text {, and }j \in L(i) \end{aligned}$$
(6)

where T is the collection of selected topics and L(i) represents the least relevant images for topic i. In this paper, the element \(C_{ij}\) will be set to 0, if image j is selected to be least relevant to topic i. Once all \(C_{ij}\)’s are updated, the algorithm begins to learn P, Q, and C further with respect to Eq. (6).

5.3 Constraint on topic-basis matrix, Q

Given the dermatology image grouping domain where experts intuitively work visually, we value the interactions at the image level. However, our verbal data also suit visual text analytics approaches for updating based on experts’ term-level inputs [27].

$$\begin{aligned}&\min _{P, Q, C \ge 0} \Vert E-PC\Vert _F^2+\Vert V-QC\Vert _F^2+\alpha \mathcal {G}(W, C)\nonumber \\&\qquad +\,\beta \mathcal {S}(C)+\gamma _1 \Vert Q-Q_r\Vert \end{aligned}$$
(7)

where \(Q_r\) is a matrix consisting of expert inputs for the verbal topics. It can be used to receive experts’ changes of topic-term distributions during interactions. Every time the matrix \(Q_r\) is changed due to an expert interaction, the underlying verbal topics in Q are learned toward it.

Table 1 Image grouping performances of fully automated learning and our paradigm

For the update of neighboring matrix W, that of the topic-coefficient matrix C, and that of the topic-basis matrix Q, the model is learned incrementally, and it is consistent between successive interactions. In order for experts to work on consistent image groupings, we also keep the visualization consistent between successive interactions. This is achieved by storing the 2D coordinates of images and using them as the starting point in the graph view [Fig. 4 (panel 1)] for the next interaction [24].

6 Evaluation and discussions

To evaluate the effectiveness of the paradigm per expert’s objectives, a domain expert (co-author) was asked to provide a reference image grouping that best matches her overall understanding of the relationships between medical images in the database. In particular, for each image she listed its most similar images in terms of their differential diagnoses. We designed an experiment to compare the image grouping performances between the results of fully automated machine learning and our expert-in-the-loop paradigm. For fully automated learning (case 1), the resulting image grouping was estimated by a model without expert inputs. In our paradigm (case 2), the physician interacted with the model in the loop toward a better image grouping result. She manipulated the images based on her medical knowledge and the clinical information presented in these images. To quantitatively evaluate the image grouping performances, we retrieved the image neighbors and compared them to the corresponding reference image grouping for both cases.

Table 1 summarizes the performances of both cases given various modalities. The image groupings with expert interactive constraints consistently outperform the traditional learning case. In particular, our paradigm performs much better than fully automated learning with verbal feature of correct diagnosis (CD). This suggests that diagnoses are the primary factor considered by the expert to group medical images. Furthermore, learning from multimodal features achieves the best performance for both cases. We also elicited the expert’s qualitative evaluation through an interview. The expert noticed the improvement of each iteration. K-means, PCA, hierarchical clustering [28], LSA, and LDA are used for comparison purposes. Since these algorithms are not easily applied to the multiple modalities, their multimodal performances are omitted. Density-based and distribution-based algorithms do not work because of the small number of data instances we have so far.

Similar to Table 1, Tables 23, and 4 show the performances within the top 10, 15 and 20 retrieved neighbors. Various algorithms are good at modeling specific data modalities, which directs our future work to model the data with different statistical properties distinctly [29]. Since we are using multimodal expert data for initialization, multiple graph-weighting strategies can also be adopted in the future to capture various data attributes [26]. Image features generally do not perform as well as verbal features. Intuitively, this is because the verbal features contain the terms that capture the domain knowledge, whereas the image features do not. Also observed is that the eye-tracking filters do not always boost the performance of image features. This points us to future work involving improved use of eye movement data, such as using the high-level behavioral patterns [30]. Experimental outcomes show that Laplacian sparse coding does not always or substantially beat some of learners. Therefore, our future work also includes adaptation and improvement of other approaches for interactive machine learning.

Table 2 The percentage of images in the reference list to appear within the top 10 retrieved neighbors
Table 3 The percentage of images in the reference list to appear within the top 15 retrieved neighbors
Table 4 The percentage of images in the reference list to appear within the top 20 retrieved neighbors

During the paradigm evaluation, we also recorded the expert’s verbal labeling of the image groups. The labeling of image groups is useful to disclose her diagnostic reasoning while grouping images. This can be incorporated in future work to optimize the semantic feature space. Another important part of our future work involves implementing our paradigm on a larger dermatological image database with more experts in the loop to test our paradigm’s robustness. New images with no eye-tracking trials and with no or very sparse annotations (few words of the morphology categories or the disease names) will be first positioned in the model simply based on visual similarities. An image hierarchy can be learned and visualized. For the ease of expert interactions, a few representative images can be selected from each group. In the case where new images do not even have offline annotations, they can still be positioned in an existing image grouping for further improvements, since single modality features can be easily projected into the unified topic space [13].

The presentation of image groupings could also be based on experts’ trade-off between various factors, such as the primary lesion morphology and the causes of the diseases. Our current visualization may not be feasible for a larger database. It is necessary to design a more effective visualization strategy to allow experts to explore both global structure and local details of image grouping. To receive experts’ accurate inputs through interactions, a learning framework with feature selection can be adopted [31]. Furthermore, to minimize the offset between the neighborhood in the topic space and that in the visualization space, a joint regularization strategy can be developed [32]. It can also be envisioned that when dealing with a larger dataset a few image cases could constantly bubble up in the neighborhood and cause user fatigue while repeatedly skipping them. Our future work therefore also includes developing a penalty term to isolate the images implicitly skipped by a user.

There are both global and local constraints in our paradigm. In general, the global constraint such as the neighboring matrix in Sect. 5.1 is to make sure that the learned hidden topics best retain the relationships between the observed data points, whereas the local constraints usually refer to experts’ localized changes regarding a small subset of image relations. The balance is achieved through the interactive process when the machine and an expert finally agree upon each other’s decision. For the flexibility of the model and the generalization of the paradigm for a larger dataset, our future work involves replacing the hard constraints in Eq. 5 with soft ones. In this way, the parameters in neighboring graph can also be learned and adapted to reflect the relative similarities among the neighbors locally. In order to balance the influences between the offline collected expert data and online expert inputs, other soft constraints could be applied by encoding expert interactions in a new penalty term. Besides, similar to updating verbal topics in Eq. 7, we plan to allow updating eye movement-filtered image patterns through a similar term \(\gamma _2 \Vert P-P_r\Vert \) and support such updates by adding corresponding visualizations.

In our model, expert input is transformed into constraints, which are then used to update the model. Experts have the flexibility to provide all the constraints in one round or separate them into multiple rounds. The order of these constraints does not affect the final model. In another word, the final model remains the same as long as the same set of constraints are provided (stability). However, the intermediate result of the model may affect an expert’s decision making, which may lead to them to provide different constraints. Such bidirectional effects have been observed in human-agent reciprocal social interaction studies [33]. This kind of dynamics is interesting and can be studied in our future work.

In the realm of interactive machine learning, there is always a trade-off between the power of the used model to capture the underlying semantics and the simplicity of the model to achieve good responsiveness and support realtime interactions. In a typical interaction loop based on our current implementation, the expert spent 1 min inputting her constraint and the learning algorithms (including the visualization algorithm) converged within 10 seconds in a single-core machine. We consider approximated learning rules for better responsiveness in the future and online learning algorithms to handle new data points [34]. Moreover, we may use a different learning framework for fast model updates than that for model initialization [35].

7 Conclusions

This paper presents an interactive machine learning paradigm with experts in the loop for improving image grouping. We demonstrate that image grouping can be significantly improved by expert constraints through incremental updates of the underlying computational model. In each iteration, our paradigm allows to accommodate our model to experts’ input. Performance evaluation shows that expert constraints are an effective way to infuse expert knowledge into the learning process and improve overall image grouping.