Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In recent years, as pointed in [1], there has been an increasing interest in Web usage mining as a means to capture Web user behavioral patterns and to derive e-business intelligence. In [24], for example, automatic personalization was proposed based on clustering of user transaction and page-views. A prevalent alternative approach for building personalized recommendation engines would be collaborative filtering. Given a record of activity of a target user, the collaborative filtering approach compares that record with the historical records of other users so as to find the top users who have similar taste or interest. However, it is known that the collaborative filtering approach has some deficiency, see e.g. [58], and some optimization strategies have been proposed in [911] to overcome such shortcomings. More recently, a personal browsing assistant system is developed in [12], where the pre-fetched resources from the hyper-linked Web pages are compared so as to recommend which Web page should be requested next. As an application, there is recommendation engine specialized to fashions, e.g. [13, 14]. To the best knowledge of the authors, however, the color information has not been incorporated in the literature for developing better personalized recommendation engines.

The reason why colors of products has been ignored in e-marketing can be found in that a product typically involves many different colors. While one dominating color of a product may be identified in the eye of human based on the overall impression, it is difficult to mechanize the process for identifying the dominating color. Accordingly, in many applications, the color of a product is defined subjectively by those who enter the data. Furthermore, terms for describing a color are often quite vague and too many. Consequently, the color of a product has been a missing link in e-marketing. The purpose of this paper is to fill this gap by developing an algorithmic procedure for identifying the dominating color of a product by analyzing a digital image of the product. The algorithmic procedure enables one to reveal the color preference of a consumer by analyzing the digital images of the products purchased by the consumer. A recommendation engine is also developed based on color class preference vectors of individual consumers as shown in Fig 30.1.

Fig. 30.1
figure 1

Overview of the proposing personalized recommendation engine

Throughout the paper, vectors and matrices are indicated by underbar and doubleunderbar respectively, e.g. \(\underline{\xi },\underline{\underline{P}}(t)\), etc.

2 Personalized Recommendation Engine Based on the Color of the Product

2.1 Development of Algorithm for Identifying the Dominating Color of a Product

A typical digital image of a product used in e-fashion business consists of a number of pixels, which would be too many to define the single dominating color of the product. In order to overcome this difficulty, we introduce \(\underline{\varPhi }(v_{p}) \in \mathcal{R}^{6}\), which we call a CCPV (Color-Class Profile Vector) of a digital image v containing a product p. In the eye of human, however, the Euclidean distance in RGB does not necessarily reflect the way humans differentiate different colors sensuously. Because of this reason, CIE (Commission International de l’Éclairage), the international commission on illumination, proposed the space denoted by CIE-Lab in 1978. In CIE-Lab space, RED, GREEN, YELLOW, BLUE, WHITE and BLACK are extremums of the axes as representative colors[1517]. By defining the closeness of their representative six colors, we converted each pixel to facilitate by clustering. Based on this idea, we transform a set of pixels constituting a digital image of a product, denoted by v p in CIE-Lab, into the set of six dimensional vectors. By measuring the Euclidean distances between each of the transformed vectors and six fixed points in CIE-Lab representing RED, GREEN, YELLOW, BLUE, WHITE and BLACK and then taking the average over the pixels in v p , the sensuous color of the product in the eye of human is represented by a vector \(\underline{\varPhi }(v_{p})\) in CIE-Lab. \(\underline{\varPhi }(v_{p})\) are calculated through the following steps.

Step 1::

Extraction of the pixels of the product image from the background

Every digital image obtained from the data has the background constructed by the unique pixel for representing “NON-COLOR”. This pixel is different from the pixel corresponding to “WHITE” and never appears in digital images of products. Accordingly, the set of pixels exactly constituting the digital image of the product p can be extracted. The resulting set of pixels is denoted by v p , and the number of pixels in v p is written as \(N_{v_{p}}\).

Step 2::

Transformation of RGB vectors into CIE-L a b vectors

In Step 2, this transformation is conducted. Transformation \(\mathcal{T}_{\mathrm{I}}\) of the pixel \(\underline{\gamma }= ^{\mathrm{t}}(\gamma _{\mathrm{R}},\gamma _{\mathrm{G}},\gamma _{\mathrm{B}}) \in\) RGB into \(\underline{\eta }= ^{\mathrm{t}}(\eta _{\mathrm{L}},\eta _{\mathrm{a}},\eta _{\mathrm{b}}) \in\) CIE-Lab is constructed in three stages. In the first stage, \(\underline{\gamma }\) is mapped into an intermediate vector \(\underline{X} = ^{\mathrm{t}}(X_{1},X_{2},X_{3})\) via the liner transformation defined by,

$$\displaystyle{ \left (\begin{array}{c} X_{1} \\ X_{2} \\ X_{3}\end{array} \right ) = \left (\begin{array}{rcrcr} 0.4125&\ &0.3576&\ &0.1804\\ 0.2127 &\ &0.7151 &\ &0.0722 \\ 0.0193&\ &0.1192&\ &0.9502\end{array} \right )\times \left (\begin{array}{c} \gamma _{\mathrm{R}}\\ \gamma _{ \mathrm{G}}\\ \gamma _{\mathrm{B} }\end{array} \right ). }$$
(30.1)

The second stage constructs \(\underline{f} = ^{\mathrm{t}}(\,f_{1},f_{2},f_{3})\) from \(\underline{X} = ^{\mathrm{t}}(X_{1},X_{2},X_{3})\) through the following definition. For i = 1, 2, 3, let f i be defined by,

$$\displaystyle{ f_{i} = \left \{\begin{array}{ll} X_{i}^{\frac{1} {3} } & \mathrm{if\ }X_{i}> 0.008856 \\ \frac{903.3X_{i} + 16} {116} &\mathrm{else}\end{array} \right.. }$$
(30.2)

Finally, \(\underline{f}\) is mapped into \(\underline{\eta }\) by,

$$\displaystyle{ \left (\begin{array}{c} \eta _{\mathrm{L}}\\ \eta _{\mathrm{a} } \\ \eta _{\mathrm{b}}\end{array} \right ) = \left (\begin{array}{rcrcr} 0&\ & 116&\ & 0\\ 500 &\ & - 500 &\ & 0 \\ 0&\ & 200&\ & - 200\end{array} \right )\times \left (\begin{array}{c} f_{1} \\ f_{2} \\ f_{3}\end{array} \right )+\left (\begin{array}{c} - 16\\ 0 \\ 0\end{array} \right ).\ \ \ }$$
(30.3)
Step 3::

Construction of a CCPV

Given \(\underline{\eta }\in\) CIE-Lab, we consider another transformation \(\mathcal{T}_{\mathrm{II}}:\) CIE-Lab → CC \(\subset \mathcal{R}_{+}^{6}\), where \(\mathcal{R}_{+}^{6}\) is the set of nonnegative vectors in \(\mathcal{R}^{6}\). The space CC, standing for “Color Class”, is introduced so as to develop several different color classes as we will see. For constructing CC, the transformation \(\mathcal{T}_{\mathrm{II}}\) is defined by measuring the inverse of the squared Euclidean distances between \(\underline{\eta }\) and six fixed points in CIE-Lab representing RED, GREEN, YELLOW, BLUE, WHITE and BLACK. More formally, we consider the following six fixed points in CIE-Lab.

$$\displaystyle{ \begin{array}{lclrrrllclrrrl} \underline{\eta }_{\mathrm{R}}\! & =&\!^{\mathrm{t}}(\!& 50,&50,& 0&\!),&\quad \underline{\eta }_{\mathrm{G}}\! & =&\!^{\mathrm{t}}(\!&50,& - 50,& 0&\!), \\ \underline{\eta }_{\mathrm{Y}}\! & =&\!^{\mathrm{t}}(\!& 50,& 0,&50&\!),&\quad \underline{\eta }_{\mathrm{B}}\! & =&\!^{\mathrm{t}}(\!&50,& 0,& - 50&\!), \\ \underline{\eta }_{\mathrm{W}}\! & =&\!^{\mathrm{t}}(\!&100,& 0,& 0&\!),&\quad \underline{\eta }_{\mathrm{BK}}\! & =&\!^{\mathrm{t}}(\!& 0,& 0,& 0&\!),\end{array} }$$
(30.4)

where each color in RGB are represented as,

$$\displaystyle{ \begin{array}{lclrrrl lclrrrl} \mathcal{T}_{\mathrm{I}}^{-1}(\underline{\eta }_{\mathrm{R}})\! & =&\!^{\mathrm{t}}(\!&0.59,&0.06,&0.18&\!),&\quad \mathcal{T}_{\mathrm{I}}^{-1}(\underline{\eta }_{\mathrm{G}})\! & =&\!^{\mathrm{t}}(\!& - 0.04,&0.25,&0.16&\!), \\ \mathcal{T}_{\mathrm{I}}^{-1}(\underline{\eta }_{\mathrm{Y}})\! & =&\!^{\mathrm{t}}(\!&0.29,&0.17,&0.01&\!),&\quad \mathcal{T}_{\mathrm{I}}^{-1}(\underline{\eta }_{\mathrm{B}})\! & =&\!^{\mathrm{t}}(\!& 0.04,&0.19,&0.55&\!), \\ \mathcal{T}_{\mathrm{I}}^{-1}(\underline{\eta }_{\mathrm{W}})\!& =&\!^{\mathrm{t}}(\!&1.25,&0.95,&0.91&\!),&\quad \mathcal{T}_{\mathrm{I}}^{-1}(\underline{\eta }_{\mathrm{BK}})\!& =&\!^{\mathrm{t}}(\!& 0,& 0,& 0&\!).\end{array} }$$
(30.5)

Given \(\underline{\gamma }\in v_{p}\), let \(\underline{\eta }= \mathcal{T}_{\mathrm{I}}(\underline{\gamma }) \in\) CIE-Lab and define \(\underline{\phi }(\underline{\gamma }) = \mathcal{T}_{\mathrm{II}} \circ \mathcal{T}_{\mathrm{I}}(\underline{\gamma }) = \mathcal{T}_{\mathrm{II}}(\underline{\eta })\) by

$$\displaystyle{ \underline{\phi }(\underline{\gamma })\stackrel{\mbox{ def}}{=}c\left (\begin{array}{c} \vert \vert \underline{\eta }_{\mathrm{R}} -\underline{\eta }\vert \vert ^{-2} \\ \vert \vert \underline{\eta }_{\mathrm{G}} -\underline{\eta }\vert \vert ^{-2} \\ \vert \vert \underline{\eta }_{\mathrm{Y}} -\underline{\eta }\vert \vert ^{-2} \\ \vert \vert \underline{\eta }_{\mathrm{B}} -\underline{\eta }\vert \vert ^{-2} \\ \vert \vert \underline{\eta }_{\mathrm{W}} -\underline{\eta }\vert \vert ^{-2} \\ \vert \vert \underline{\eta }_{\mathrm{BK}} -\underline{\eta }\vert \vert ^{-2}\\ \end{array} \right )\,\ }$$
(30.6)

where \(\vert \vert \underline{x}\vert \vert\) denotes the Euclidean norm of \(\underline{x}\), and c is the normalization constant. It should be noted that \(\underline{\phi }(\underline{\gamma })\) is a probability vector, where each component describes the how a typical person would sense the pixel represented by \(\underline{\gamma }\) to the corresponding color in RED, GREEN, YELLOW, BLUE, WHITE and BLACK.

The schematic diagram of the above steps are shown in Fig. 30.2.

Fig. 30.2
figure 2

Algorithm of CCPV construction for typical digital image

The color-class profile vector of v p can now be defined by,

$$\displaystyle{ \underline{\varPhi }(v_{p})\stackrel{\mbox{ def}}{=} \frac{1} {N_{v_{p}}}\sum _{\underline{\gamma }\in v_{p}}\underline{\phi }(\gamma ). }$$
(30.7)

We may say that \(\underline{\varPhi }(v_{p})\) describes how a typical person would sense the six different colors RED, GREEN, YELLOW, BLUE, WHITE and BLACK from the overall impression of the digital image v p of product p.

2.2 Development of Color-Classes via Clustering of CCPVs

The algorithmic procedure described in Sect. 26.2.1 enables one to represent each digital image v p of product p by the corresponding CCPV, \(\underline{\varPhi }(v_{p})\). The data obtained from X Corporation contain 5665 such digital images, to each of which one of 425 colors was assigned by X Corporation. The purpose of this section is to develop a reasonable number of color classes by clustering these 425 colors, so that the effects of color in marketing can be analyzed efficiently. For this purpose, we represent each color defined by X Corporation by a CCPV in CIE-Lab. More specifically, let x be a color given by X Corporation and define,

$$\displaystyle{ V (x) =\{ v_{p}:\mathrm{ the\ color\ }x\mathrm{\ is\ assigned\ to\ product\ }p\}. }$$
(30.8)

The number of elements in V (x) is denoted by N(x) = | V (x) | . The color x is then represented by \(\underline{\varPhi }_{x} \in\) CIE-Lab where,

$$\displaystyle{ \underline{\varPhi }_{x} = \frac{1} {N(x)}\sum _{v_{p}\in V (x)}\underline{\varPhi }(v_{p}). }$$
(30.9)

2.3 Color Class Preference Vectors of Customer

In order to define a color class preference vector of a consumer, we introduce the following sets.

  • CUST = { i: 1 ≤ i ≤ N c }: the set of customers

  • S = { j: 1 ≤ j ≤ N s }: the set of product categories

  • S( j ): the number of products in the product category j ∈ S

  • q r ( j ): the set of products which are identical having the same product ID but belong to different color classes in the rth product in the product category j ∈ S

  • Q( j ) = { q 1( j ), ⋯ , q S( j )( j )}: the set of product groups in \(S(\,j\,)\), where each group consists of identical products having different color classes

  • N CC: the number of color classed to be combined

  • CC = { 1, ⋯ , N CC}: the set of color classes

  • n(i, j, x): the number of products, purchased by consumer i ∈ CUST, which belong to Q( j ) having color class x ∈ CC

For l ∈ CC, l = 1, ⋯ , m, let the color class distribution vector, \(\underline{\theta }(i,j)\), be defined by

$$\displaystyle{ \underline{\theta }(i,j) = [\theta (i,j,1),\cdots \,,\theta (i,j,N_{\mathrm{CC}})];\ \ \theta (i,j,l) = \frac{n(i,j,l)} {\sum _{k=1}^{N_{\mathrm{CC}}}n(i,j,k)}. }$$
(30.10)

The corresponding mean and variance vectors, \(\underline{\mu }(j)\), \(\underline{\sigma }(j)\), can be obtained as

$$\displaystyle{ \underline{\mu }(j) = \frac{1} {N_{c}}\sum _{i\in CUST}\underline{\theta }(i,j)\, }$$
(30.11)
$$\displaystyle{ \underline{\sigma }(\,j\,) = [\sigma (\,j,1),\cdots \,,\sigma (\,j,N_{\mathrm{CC}})];\ \sigma (\,j,l\,) = \sqrt{ \frac{1} {N_{c}\! -\! 1}\!\sum _{i\in CUST\!\!\!\!\!\!\!\!\!\!}\!\{\theta (i,j,l)\! -\!\mu (\,j,l\,)\}^{2}}. }$$
(30.12)

Then the color class preference vector of consumer i ∈ CUST for the product category j ∈ S can be defined in the following manner.

$$\displaystyle{ \underline{z}(i,j) = [z(i,j,1),\cdots \,,z(i,j,N_{\mathrm{CC}})];\ \ z(i,j,l) = \frac{\theta (i,j,l) -\mu (j,l)} {\sigma (j,l)} . }$$
(30.13)

Let \(CCQ\left (j, \check{j} \right )\) be the set of color classes which products in \(q_{\check{j} } (j) \in Q(j)\) possess. If consumer i is to purchase a product \(p \in q_{\check{j} } (j) \in Q(j)\), then the color \(\tilde{x}(i,j)\) to be recommended is determined by

$$\displaystyle{ \tilde{x}(i,j) =\arg \max _{x\in CCQ(j,\check{j} \,)}\{z(i,j,x)\}. }$$
(30.14)

In the approach discussed above, the color class preference vector of consumer i ∈ CUST is defined for each product category j ∈ S. As an alternative approach, the single color class preference vector of consumer i ∈ CUST may be employed for all the products in \(Q =\bigcup _{ j=1}^{N_{S}}Q(j)\). In this case, in place of Eq. (30.10), we define

$$\displaystyle{ \underline{\theta }(i) = [\theta (i,1),\cdots \,,\theta (i,N_{\mathrm{CC}})]\;\ \ \theta (i,l) = \frac{\sum _{j=1}^{N_{\mathrm{S}}}n(i,j,l)} {\sum _{j=1}^{N_{\mathrm{S}}}\sum _{k=1}^{N_{\mathrm{CC}}}n(i,j,k)}. }$$
(30.15)

Then the color class preference vector of i ∈ CUST for all the product in Q can be defined by

$$\displaystyle{ \underline{z}(i) = [z(i,1),\cdots \,,z(i,N_{\mathrm{CC}})]\;\ z(i,l) = \frac{\theta (i,l) -\mu (l)} {\sigma (l)} \, }$$
(30.16)

where the mean and the variance vectors are also changed accordingly as \(\underline{\mu }\) and \(\underline{\sigma }\). Let \(CCQ(j,\check{j})\) be defined as before. If consumer i ∈ CUST is to purchase a product \(p \in q_{\check{j}}(j) \in Q(j)\), then the color \(\tilde{x}(i)\) to be recommended is determined by,

$$\displaystyle{ \tilde{x}(i) =\arg \max _{x\in CCQ(j,\check{j})}\{z(i,x)\}. }$$
(30.17)

The latter approach may work better because of the larger data volume involved in constructing \(\underline{z}(i)\). If the color class was not defined because of the lack of the purchase history for the customer or the lack of the color options, the engine would recommend the default choice of the color.

3 Numerical Experiments Based on Real Data

3.1 Data Description

Sumita Research Laboratory at the University of Tsukuba has been working with a TV shopping company, hereafter called X Corporation, for developing a CRM (Customer Relationship Management) support engine based on real data. X Corporation has been in retail business worldwide, offering a variety of products ranging from Apparel products, Jewelries, and Home electronics appliances to foods. A typical digital image used in the e-business consists of \(400 \times 400 = 160,000\) pixels, which would be too much to define the dominating color-class of the image.

The data obtained from X corporation consist of demographic information of those consumers who purchased at least one product during the period between September 1st, 2004 and August 31st, 2007, as well as their purchasing records and channels, product records and TV programs during the period. The amount of consumers, N c , was 455,415, the amount of product categories, N s , was 34 and the data consisted of about 2.3 million records. The average number of purchase occasions per customer and purchased quantity per customer were 3.70 and 5.33, respectively. The digital images collected from the data obtained from X Corporation amount to 6762, involving 1782 types of products spread over 34 small categories. The structure and the key components of these records are in Fig. 30.3. The database of X Corporation defines 430 colors appear for the products corresponding to the 6762 digital images. However, five of them are clearly useless (e.g. NON-COLOR, CLEAR) and eliminated. Consequently, the data to be used for our analysis contain 425 colors (corresponds to 5665 digital images) defined by X Corporation. In what follows, these 425 colors are categorized into several number of newly defined color-classes by analyzing the 6762 digital images. The algorithmic procedure used to establish the color-classes can be applied to a digital image of any product with one of the 425 colors, identifying the dominating color-class of the product automatically. In turn, the algorithmic procedure enables one to canalize the consumers from the perspective of color preferences, thereby filling the missing link in e-marketing.

Fig. 30.3
figure 3

Category structure

In order to cluster 425 colors, each represented by \(\underline{\varPhi }_{x}\), we employ the group average method in hierarchical clustering [18, 19]. In this approach, a set of vectors would be grouped together one by one based on the nearest Euclidean distance until the predetermined number of clusters would exhaust the original set. In each grouping, the resulting cluster is represented by one vector which can be generated as the weight center of the two clusters to be merged. We terminated the grouping just before the six representative color (RED, GREEN, YELLOW, BLUE, WHITE, BLACK) combined to the other six representative color.

For each cluster generated by the above algorithm, the histogram is constructed by 425 colors over the digital images involved in the cluster. Namely if a cluster consists of \(\underline{\varPhi }_{x(1)},\cdots \,,\underline{\varPhi }_{x(T)}\), then the histogram is constructed over the products in \(\bigcup _{l=1}^{T}V (x(l))\). The grouping resulted into generate 14 color-classes, (i.e. N CC = 14), named as BLACK, BEIGE, WHITE, PINK, BROWN, GRAY, BLUE, NAVY, GREEN, PURPLE, RED, ORANGE, SAXE-BLUE and YELLOW.

3.2 Accuracy Test for Color Class Recommendation Engine

In this subsection, we examine the accuracy of the color class recommendation engine developed in Sect. 30.2.3. The data set obtained from X Corporation is decomposed into ten subsets of equal size randomly. Based on the cross validation approach, nine subsets are used to construct \(\underline{z}(i,j)\) in Eq. (30.13) and \(\underline{z}(i)\) in Eq. (30.16), while the remaining subset is used for testing accuracy. In order to provide a basis for comparison, the following random estimation accuracy is considered.

Random Estimation:

If consumer i is to buy a product \(p \in q_{\check{j}}(j)\) and a color class is chosen randomly, the probability of its correctness is given by \(\vert CCQ(j,\check{j})\vert ^{-1}\) where \(CCQ(j,\check{j})\) is the set of color classes which products in \(q_{\check{j}}(j)\) possess.

Table 30.1 Accuracy test of the recommendation engine based on \(\underline{z}(i,j)\) and \(\underline{z}(i)\) for customer i and category j (“ratio” notes acc./rand.)

In Table 30.1, the results for testing accuracy based on \(\underline{z}(i,j)\) in Eq. (30.13) and the results for testing accuracy based on \(\underline{z}(i)\) in Eq. (30.16) are exhibited respectively. One can observe that the color class recommendation engine outperforms the random estimation consistently with only one exception for “51 Watch” in row of table \(\underline{z}(i,j)\). However, even for this product, the color class recommendation engine based on \(\underline{z}(i)\) supersedes the random estimation by a factor of two. It can be seen that, when the volume of test data is high, the color class recommendation engine based on \(\underline{z}(i)\) outperforms the color class recommendation engine based on \(\underline{z}(i,j)\). This implies that color preferences of consumers are reflected beyond product categories for products which are purchased rather often at modest prices, as represented by Fashion Wear (10 through 19), Bag (20 through 26) and Fashion Gadget (50 through 56). For more expensive products which are likely to be purchased with less frequency, however, color preferences of consumers within the product category prevail over those derived from all products, as can be seen in Fashion Accessory (30 through 34) and Brand Accessory (40 through 44). This result is in agreement that one who have the color to prefer may buy the other color product as an accent color. In any case, one may select whichever the recommendation engine based on \(\underline{z}(i,j)\) or \(\underline{z}(i)\), by considering which is suitable for the genres of product.

4 Conclusion

One of the important factors ignored in the past analyses in e-marketing is “colors” of products. This is so because it is difficult to define a color of a product, which typically consists of many different colors. The purpose of this research is to fill this gap by developing an algorithmic procedure for identifying the dominating color of a product by analyzing a digital image of the product. Since humans tend to clearly distinguish RED from GREEN as well as YELLOW from BLUE, the Euclidean distance in CIE-Lab is more consistent with the sensuous feeling of human for colors than the Euclidean distance in RGB. Accordingly, for analyzing color preferences of consumers in e-marketing, CIE-Lab is more appropriate than RGB. Based on this idea, we proposed the CCPV (Color-Class Profile Vector) which represents the overall impression of a digital image containing a product. Since each product has its color in the data base, these vectors can be utilized to categorize many different colors, resulting in 14 color classes. This enables one to study color preferences of consumers by segments. Furthermore, it provides a basis for constructing a recommendation engine based on the color classes for enhancing e-commerce. We had also confirmed the effectiveness of personalized recommendation engine with CCPV from the numerical experiments based on real data. This study is still in its infancy. It would be necessary to combine the color analysis proposed in this thesis with other approaches, such as automatic personalization and collaborative filtering, so as to empower the existing recommendation engines. This line of research is underway and will be reported elsewhere in due course.