Know your customer: computing k-most promising products for targeted marketing
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s00778-016-0428-3
- Cite this article as:
- Islam, M.S. & Liu, C. The VLDB Journal (2016) 25: 545. doi:10.1007/s00778-016-0428-3
- 1 Citations
- 744 Downloads
Abstract
The advancement of World Wide Web has revolutionized the way the manufacturers can do business. The manufacturers can collect customer preferences for products and product features from their sales and other product-related Web sites to enter and sustain in the global market. For example, the manufactures can make intelligent use of these customer preference data to decide on which products should be selected for targeted marketing. However, the selected products must attract as many customers as possible to increase the possibility of selling more than their respective competitors. This paper addresses this kind of product selection problem. That is, given a database of existing products P from the competitors, a set of company’s own products Q, a dataset C of customer preferences and a positive integer k, we want to find k-most promising products (k-MPP) from Q with maximum expected number of total customers for targeted marketing. We model k-MPP query and propose an algorithmic framework for processing such query and its variants. Our framework utilizes grid-based data partitioning scheme and parallel computing techniques to realize k-MPP query. The effectiveness and efficiency of the framework are demonstrated by conducting extensive experiments with real and synthetic datasets.
Keywords
Product selection Dynamic skylines Reverse skylines Algorithms Complexity analysis1 Introduction
The competitive products are the alternative choices potential customers can decide to buy over any available product. The rampant use of World Wide Web for selling goods online allows the manufacturer to collect customer preferences for product features, e.g., search queries of online users and thereby, make intelligent use of these preference data to identify the competitive products as well as the potential buyers for them. The study of competitive products is crucially important for the manufacturers to sustain in the global market and has attracted considerable attention to the community ([1, 6, 9, 13, 14, 15, 20, 25, 27, 28, 30] for survey), e.g., the sale department can exploit this kind of study to find customer groups who are most likely to buy their products and also, to design specialized promotions, advertisement campaigns, coupons or similar promotions to expedite the sales of their products. In general, the promotion events are meant to increase the sales of the products and thereby, increase the overall revenue. However, several products might be interesting for the same customer and not all products contribute equally to attract customers in the market. Therefore, the manufacturers wish to identify only a subset of products that can possibly attract the highest number of customers in the market so that the advertising and other costs spread over a larger number of customers.
Consider another example from the job market area where a software firm wishes to hire some employees to fill up a few vacant positions. The firm can advertise these positions along with the required skills and employment benefits. However, the advertised positions need to compete with other job openings available in the market. Therefore, the firm wishes to attract as many candidates as possible so that the probabilities of getting more interviewees are increased and the firm can select some best employees by interviewing them. Here, we can treat the firm as the manufacturer, job openings as products and the candidates as the customers.
The above product selection problem is defined as follows: given a set of products P from existing market, a set of own products Q called query products, a dataset of customer preferences C where \(C\gg P\gg Q\), find k products from Q with maximum expected number of total customers. We term this as finding k-most promising products (k-MPP). The solution to this k-MPP product selection problem requires a product adoption model for the customers and a product selection strategy for the manufacturer. The product adoption model tells which products are competing with each other to attract a particular customer and how much a product contributes to attract a customer. On the contrary, the product selection strategy tells how to select products with maximum expected number of customers.
In this paper, we present a novel product adoption model and a product selection strategy based on dynamic skyline [2, 18] and reverse skyline [4, 12] queries. The dynamic skyline query is used to retrieve data (products) from customer’s perspective, and reverse skyline query is used to retrieve data (customers) from manufacturer’s perspective. Given a database of products P and a customer preference c as query, the dynamic skyline query for c retrieves all products \(p_1\in P\) that are preferable to c than other products \(p_2\in P\). That is, products \(p_1\) match the preference of c better than other products \(p_2\) in P. On the other hand, given a database of customer preferences C, products P and a query product \(p_1\), the reverse skyline query retrieves all customers \(c\in C\) that prefer \(p_1\) than any other product \(p_2\in P\). In other words, the reverse skyline of \(p_1\) consists of all customers c that include \(p_1\) in their dynamic skylines. Both the dynamic skyline and reverse skyline queries follow the around-by semantics, i.e., absolute differences between the product attributes’ values and the customer preferences in those attributes. These types of queries are studied in [29] and [1] for supporting market research queries. These works assume that a customer appearing in the product’s reverse skyline contributes 100 % for the sustainment of the product in the market, i.e., the influence of a product is defined by its reverse skyline size. However, the dynamic skyline of a customer may consist of products from the competitors in the market, not only the company’s own products. That is, a customer may have other products in her preference list and none of them is dominated by the manufacturer’s own product (and vice versa). Therefore, the manufacturers cannot be firm about the adoption of the product by a customer; rather, they may associate a probability with it. Existing market research queries [1, 29] disregard the above.
In our model, we associate a probability with the product adoption among customers based on dynamic dominance (i.e., around-by semantics) and skylines (i.e., preference-based queries). The models in [13] and [30] also associate a probability with the product adoption among customers. However, these models [13, 30] assume that a product satisfies the requirements of a customer if the product attributes’ values are less than or greater than the customer-specified preferences in those attributes. Therefore, these models fail to model the scenarios where a customer might not like to minimize or maximize certain quality metric of a product. For example, a customer may not like a too small or a too big screen for a laptop; rather, he/she may like a certain range for it. This kind of preferences can only be modeled appropriately in dynamic skyline and reverse skyline queries with around-by semantics, i.e., the absolute differences between product attributes’ values and the customer preferences in those attributes.
We also present a novel product selection strategy for finding k-most promising products (k-MPP) with maximum expected number of customers for the manufacturers based on our product adoption model. The basic computational units of a k-MPP query are dynamic and reverse skylines. Though there are a number of established studies on dynamic [18] and reverse [1, 4, 5, 19, 29] skylines, none of them are efficient for processing k-MPP queries. Most of these works either rely on index structures that are query dependent or are not optimized for multiple units. In this paper, a query-independent grid-based data partitioning scheme is developed to process k-MPP queries by designing parallel algorithms for them. Our index scheme selectively stores one or more data objects as pivots from each partition to filter data objects belonging to a partition as early as possible while processing the basic computational units of a k-MPP query. Further, an approach is developed for grouping multiple queries and processing them together with expedite the efficiency.
We develop a novel probabilistic product adoption model among customers based on dynamic and reverse skylines.
We develop a novel product selection strategy for finding the k-most promising products (k-MPP for short) with maximum expected number of customers.
We present a parallel computing approach for processing k-MPP product selection query and its variants by designing a simple yet efficient query-independent grid-based data partitioning scheme.
We also demonstrate the effectiveness and efficiency of our approach by conducting extensive experiments with both real and synthetic datasets.
2 Background
Definition 1
(Dynamic dominance) A data object \(ob_1\) dynamically dominates another data object \(ob_2\) w.r.t. a third data object \(ob_3\), denoted by \(ob_1\prec _{ob3}ob_2\), if \(ob_1\) is closer to \(ob_3\)’s values in all dimensions and is strictly closer to \(ob_3\)’s value in at least one dimension than \(ob_2\). Mathematically, the relation \(ob_1\prec _{ob3}ob_2\) holds iff (a) \(|ob_3^i-ob_1^i|\le |ob_3^i-ob_2^i|, \forall i\in [1,...,d]\) and (b) \(|ob_3^i-ob_1^i|< |ob_3^i-ob_2^i|, \exists i\in [1,...,d]\).
Example 1
Consider the dataset of products and customers given in Fig. 1. According to Definition 1, \(p_1\) dynamically dominates \(p_3\) w.r.t. \(c_1\), i.e., \(p_1\prec _{c_1} p_3\), as \(|c_1^1-p_1^1|=|2-6|=4\le |c_1^1-p_3^1|=|2-6|=4\) and \(|c_1^2-p_1^2|=|8-6|=2\le |c_1^2-p_3^2|=|8-20|=12\).
From Definition 1, it is easy to verify that dynamic dominance relies on around-by semantics, which is also exemplified in Example 1. Dynamic dominance plays a very important role in modeling the customer preferences for a product in dynamic skyline [18] and reverse skyline [4] queries and studied extensively to establish the customer–product relationship [1, 29].
2.1 Dynamic skyline
Definition 2
(Dynamic skyline [18]) The dynamic skyline of a customer \(c\in C\), denoted by DSL(c), consists of all products \(p_1\in P\) that are not dynamically dominated by other products \(p_2\in P\) w.r.t. the customer c, i.e., \(p_2\not \prec _cp_1\).
Example 2
Consider the dataset given in Fig. 1. According to Definition 2, the dynamic skyline of customer \(c_5\) consists of products \(p_4, p_7\) and \(p_9\) as these products are not dynamically dominated by other products in P w.r.t. \(c_5\). Similarly, the dynamic skylines of \(c_3\) and \(c_4\) are {\(p_2, p_3, p_4\)} and {\(p_2, p_3, p_4, p_5\)}, respectively.
The dynamic skyline DSL(c) consists of all products \(p_1\in P\) that dynamically match the customer preference c better than any other product \(p_2\in P\). Therefore, the dynamic skyline query is used to retrieve products from customer’s perspective or point of view. We say every product \(p\in DSL(c)\)competes with each other for the customer c to enter into the market, e.g., products \(p_4, p_7\) and \(p_9\) compete with each other for \(c_5\).
2.1.1 Dynamic skyline computation
2.2 Reverse skyline
Definition 3
(Reverse skyline [4]) The reverse skyline of a product \(p_1\), denoted by \(RSL(p_1)\), consists of all customers \(c\in C\) such that \(p_1\) is not dynamically dominated by other products \(p_2\in P\) w.r.t. the customer c, i.e., \(p_2\not \prec _cp_1\). In other words, the reverse skyline of a product \(p_1\) retrieves all customers \(c\in C\) such that \(p_1\) appears in the dynamic skylines of c.
Example 3
Consider the dataset of products and customers given in Fig. 1. The reverse skyline of product \(p_4\) consists of customer \(c_3, c_4\) and \(c_5\) as each of them includes the product \(p_4\) in their dynamic skylines. Similarly, the reverse skyline of \(p_3\) and \(p_5\) is \(RSL(p_3)=\{c_3, c_4\}\) and \(RSL(p_5)=\{c_4, c_9\}, respectively\).
We say that all customers appearing in RSL(p) prefer the product p to others. Therefore, these customers are considered to be the potential buyers for the product \(p_1\) in the market [1], e.g., \(c_3, c_4\) and \(c_5\) are the potential buyers for the product \(p_5\). A large number of reverse skyline indicate more customers for a product.
Definition 4
(Influence set of a product [1, 29]) The influence set of a product \(p\in P\), denoted by IS(p), consists of the customers \(c\in C\) that appear in the reverse skyline of p, i.e., RSL(p). The influence score of p is measured by the size of the set, i.e., |IS(p)|.
Example 4
Consider the dataset of products and customers given in Fig. 1. The influence set of product \(p_4\) consists of customer \(c_3, c_4\) and \(c_5\), i.e., \(IS(p_4)=\{c_3, c_4, c_5\}\) and the influence score of \(p_4\) is \(|IS(p_4)|=3\). Similarly, the influence score of \(p_3\) and \(p_5\) is \(|IS(p_3)|=|\{c_3, c_4\}|=2\) and \(|IS(p_5)|=|\{c_4, c_9\}|=2\), respectively.
2.2.1 Reverse skyline (influence set) computation
The list of symbols
Symbol | Meaning |
---|---|
P | A set of products in the market |
C | A set of customers |
Q | A set of company’s own product |
D | \(P\cup C\) |
d | The number of dimensions in D |
p | A product in P |
c | A customer in C |
q | A product in Q |
ob | An object in D |
DSL(c) | Dynamic skyline of c |
RSL(p) | Reverse skyline of p |
Pr(c, p|P) | Probability by which c buy p |
E(C, p|P) | Market contribution of p given P |
\(E(C, P^\prime |P)\) | Market contribution of the set \(P^\prime \) given P |
|IS(p)| | Influence score of p |
\(|IS(P^\prime )|\) | Influence score of \(P^\prime \) |
k-MPP | The k-most promising products |
k-MPP\(_{ind}\) | Independent k-MPP |
k-MPP\(_{dep}\) | Dependent k-MPP |
m | Number of threads (i.e., worker nodes) |
n | Grid size |
\(D_l\) | The \(l\mathrm{th}\) partition of D |
\(pos_l\) | The positional vector of \(D_l\) |
\(\delta _i\) | Side length of a partition in \(i\mathrm{th}\) dimension |
loc(ob) | Partition of D that contains ob |
\(\mathcal {N}_c(D)\) | Search space of DSL(c) |
\(\mathcal {N}_q(D)\) | Search space of RSL(q) |
\({\mathcal {N}^+}_q(D)\) | Extended search space of RSL(q) |
\(\mathcal {M}\) | The midpoint skyline |
Table 1 shows the list of symbols used in the paper.
3 The proposed product adoption model and selection strategy
This section presents our product adoption model and the product selection strategy.
3.1 The proposed product adoption model
We propose a uniform product adoption (UPA) model based on dynamic and reverse skylines. In our model, we assume that every product \(p\in P\) appearing in the dynamic skyline of the customer \(c\in C\), i.e., DSL(c), competes with each other to attract the customer c. Also, the customers \(c\in C\) that appear in the reverse skyline of \(p\in P\), i.e., RSL(p), are the potential buyers for the product p. The UPA model is described below.
Definition 5
Example 5
Consider the dataset given in Fig. 1. The probability of \(p_4\) being chosen by the customer \(c_5\) is \(Pr(c_5, p_4|P)=\frac{1}{|DSL(c_5)|}=\frac{1}{3}\) as \(|DSL(c_5)|=3\) (from Example 2). Similarly, the probabilities of \(p_4\) being chosen by customers \(c_3\) and \(c_4\) are \(\frac{1}{3}\) and \(\frac{1}{4}\), respectively.
3.1.1 Market contribution
The market contribution of a product \(p\in P\) is measured by the expected number of total customers in C that might choose to buy the product p over other products in P. We assume that a customer would be equally interested in each product that appears in her dynamic skyline. Thus, the customer assigns an equal weight to the product that is inversely proportional to the size of the dynamic skyline as described in Sect. 3.1. Finally, the contribution of a product becomes the sum of the weights it receives from all the customers in the market.
Definition 6
Example 6
Consider the dataset given in Fig. 1. The market contribution of \(p_4\) is \(E(C, p_4|P)=Pr(c_3, p_4|P)+Pr(c_4, p_4|P)+Pr(c_5, p_4|P)=\frac{1}{3}+\frac{1}{4}+\frac{1}{3}=\frac{11}{12}\). Similarly, the market contribution of \(p_3\) and \(p_5\) is \(E(C, p_3|P)=Pr(c_3, p_3|P)+Pr(c_4, p_3|P)=\frac{1}{3}+\frac{1}{4}=\frac{7}{12}\) and \(E(C, p_5|P)=Pr(c_4, p_5|P)+Pr(c_9, p_5|P)=\frac{1}{4}+\frac{1}{2}=\frac{3}{4}\), respectively.
Definition 7
Example 7
Consider the dataset of products and customers given in Fig. 1. Assume that \(P^\prime \) is a product set consisting of \(p_3, p_4\) and \(p_5\). The market contribution of the product set \(P^\prime \) is \(E(C, P^\prime |P)=E(C, p_3|P)+E(C, p_4|P)+E(C, p_5|P)=\frac{7}{12}+\frac{11}{12}+\frac{3}{4}=2.25\).
Theorem 1
The market contribution of a product set \(P^\prime \subseteq P\) is bounded as follows: \(0\le E(C, P^\prime |P)\le |C|.\)
Proof
3.1.2 Market contribution versus influence score
Definition 8
(Influence score of a product set [1]) The influence set of a product set \(P^\prime \), denoted by \(IS(P^\prime )\), consists of all customers \(c\in C\) that appear in the reverse skyline of \(p\in P^\prime \), i.e., \(IS(P^\prime )=\underset{\forall p\in P^\prime }{\cup }IS(p)\). The influence score of \(P^\prime \) is measured by the size of the set \(IS(P^\prime )\) and is denoted by \(|IS(P^\prime )|\).
Example 8
Consider the dataset given in Fig. 1. The influence set of the product set \(P^\prime =\{p_3, p_4, p_5\}\) consists of customers \(c_3, c_4, c_5\) and \(c_9\), i.e., \(IS(P^\prime )=\{c_3, c_4, c_5, c_9\}\). Therefore, the influence score of \(P^\prime \) is \(|IS(P^\prime )|=4\).
There is a clear distinction between the market contribution proposed in this paper and the influence score [1, 29]. We argue that the market contribution metric is more realistic than the influence score for measuring the product sustainment in the market. For example, if we consider the influence score metric to judge the product sustainment in the market, then both product \(p_3\) and \(p_5\) are the same (see Example 4). However, if we consider the market contribution, then \(p_5\) is preferable to the manufacturer than \(p_3\) as the expected number of customers of \(p_5\) is greater than that of \(p_3\). The influence score metric [1] assumes that the number of actual customers of a product set is equivalent to its influence set size, which is an overestimate of the expected number of customers in the market. The market contribution metric measures the expected number of customers probabilistically by taking into account all plausible choices of a customer, i.e., the market contribution combines both the customers’ perspective and the manufacturers’ perspective into the metric. The existing influence score[1] considers the manufacturers’ perspective only.
3.2 The proposed product selection strategy
We propose a novel product selection strategy for the manufacturers based on the UPA model developed in Sect. 3.1. The new query is called k-most promising products (k-MPP) selection query as given below.
Definition 9
(Generalizedk-MPPquery) Given a set of existing products P, a set of own products Q, a set of customers C and a positive integer k less than |Q|, the k-MPP query, denoted by k-\({\textit{MPP}}(Q, P, C)\), selects a subset of k products \(Q^\prime \) from Q which has the market contribution greater than any other subset of k products \(Q^{\prime \prime }\) of the product set Q.
3.2.1 Independent k-MPP
We define a k-MPP query as independent k-MPP, denoted by k-\(MPP _{ind}\), where the query products in Q do not compete with each other. That is, products in Q are considered independently while computing their contribution in the market. For example, the competitors of \(q_1\) w.r.t. customer \(c_2\) is shown in Fig. 4a, where \(q_1\) does not compete with any other products from Q, but products from their competitors only, i.e., P. This kind of k-MPP query is suitable in scenarios where a manufacturer considers to offer variants of the same type of product among the customers and likes to expedite the sales via market segmentation; or a job applicant is wished to be interviewed by each sub-team in a company if she satisfies the requirements of each of them.
Definition 10
Example 9
Consider the dataset of existing products and customers given in Fig. 1. Assume that we are given four query products \(Q:\{q_1(12, 12), q_2(7, 15), q_3(14,11), q_4(19,11)\}\) and the manufacturer wishes to retrieve top 3-products from Q with the maximum number of total customers. The market contribution of \(q_1\) is \(E(C, q_1|P)=Pr(c_2, q_1|P)+Pr(c_5, q_1|P) =\frac{1}{2}+\frac{1}{3}=\frac{5}{6}\). Similarly, we get \(E(C, q_2|P)=Pr(c_3, q_2|P)+Pr(c_4, q_2|P)=\frac{1}{2}+\frac{1}{5}=\frac{7}{10}, E(C, q_3|P)=Pr(c_2, q_3|P)+Pr(c_5, q_3|P)=\frac{1}{3}+\frac{1}{3}=\frac{2}{3}\) and \(E(C, q_4|P)=Pr(c_2, q_4|P)+Pr(c_5, q_4|P)+Pr(c_8, q_4|P)+Pr(c_9, q_4|P)+Pr(c_{10}, q_4|P)=\frac{1}{3}+\frac{1}{3}+\frac{1}{2}+\frac{1}{3}+\frac{1}{3}=\frac{11}{6}\). In \(k-MPP _{ind}\), we chose \(q_1, q_2\) and \(q_4\) as \(E(C, \{q_1, q_2, q_4\}|P)=\frac{5}{6}+\frac{7}{10}+\frac{11}{6}=3.37\) is the highest than any other 3-query products in the set Q.
3.2.2 Dependent k-MPP
We define a k-MPP query as dependent k-MPP, denoted by k-\(MPP _{dep}\), where we allow the query products in Q to compete with each other. That is, we compute the market contribution of q considering the products not only from P, but also from Q as q’s competitors. For example, the competitors of \(q_1\) w.r.t. customer \(c_2\) is shown in Fig. 4b where \(q_1\) competes with the products from P as well products from the query product set Q. This kind of k-MPP query is useful for scenarios where a manufacturer wishes to judge the sustainment of a new product which is a variant of an existing product; or a job applicant is wished to be interviewed by only one sub-team in a company.
Definition 11
Example 10
Consider the dataset of existing products and customers given in Fig. 1. Assume that we are given four query products \(Q:\{q_1(12, 12), q_2(7, 15), q_3(14,11), q_4(19,11)\}\) and the manufacturer wishes to retrieve top 3-products from Q with the maximum number of total customers. The market contribution of \(q_1\) is \(E(C, q_1|P\cup Q)=Pr(c_2, q_1|P\cup Q)+Pr(c_5, q_1|P\cup Q)=\frac{1}{4}+\frac{1}{3}=\frac{7}{12}\). Similarly, we get \(E(C, q_2|P\cup Q)=Pr(c_3, q_2|P\cup Q)+Pr(c_4, q_2|P\cup Q)=\frac{1}{2}+\frac{1}{5}=\frac{7}{10}, E(C, q_3|P\cup Q)=Pr(c_2, q_3|P\cup Q)+Pr(c_5, q_3|P\cup Q)=\frac{1}{4}+\frac{1}{3}=\frac{7}{12}\) and \(E(C, q_4|P\cup Q)=Pr(c_8, q_4|P\cup Q) +Pr(c_9, q_4|P\cup Q) +Pr(c_{10}, q_4|P\cup Q)=\frac{1}{2}+\frac{1}{3}+\frac{1}{3}=\frac{7}{6}\). In \(k-MPP _{dep}\), we chose \(q_2, q_3\) and \(q_4\) as \(E(C, \{q_2, q_3, q_4\}|P\cup Q)=\frac{7}{10}+\frac{7}{12}+\frac{7}{6}=2.45\) is the highest than any other 3-query products in the set Q.
3.3 Related product selection models
There are a number of related product selection models exist in the literature [1, 13, 29] and [30]. Given a dataset of products and customers, Wu et al. [29] propose a new algorithm called BRS to compute the influence set (IS) of a given query product q, which outperform the branch and bound algorithm proposed by Dellis et al.[4]. Later, Arvanitis et al. [1] extend this to select a subset of query products \(Q^\prime \) from a given product set Q that jointly maximize the size of the influence set, i.e., \(|IS(Q^\prime )|\). The products in this set are termed as most attractive candidates (MAC) and if the size of \(Q^\prime \) is k, then this is called the k-MAC query.
Theorem 2
Assume that \(Q^\prime \) is the set of k products selected by the k-MAC query and \(Q^{\prime \prime }\) is the set of k products selected by the k-MPP query optimally from Q. We get the following: \(|IS(Q^\prime )|\ge |IS(Q^{\prime \prime })|\) but \(E(C, Q^\prime |P\cup Q)\le E(C, Q^{\prime \prime }|P\cup Q)\) (also \(E(C, Q^\prime |P)\le E(C, Q^{\prime \prime }|P)\)).
Proof
- (i)
\(E(C, q_1|P\cup Q) =\frac{|C_1|}{2}+|C_2|\);
- (ii)
\(E(C, q_2|P\cup Q)=\frac{|C_1|}{2}\);
- (iii)
\(E(C, q_3|P\cup Q)=|C_3|;\)
- (iv)
\(IS(\{q_1, q_3\}) =C_1\cup C_2\cup C_3\) and
- (v)
\(IS(\{q_1, q_2\}) =C_1\cup C_2\).
Given a set of products and customers, Lin et al. [13] propose a product selection model called discovering k-most demanding products, denoted by k-MDP, where a product p is assumed to satisfy a customer c iff \(p^i\le c^i\). This work also maximizes the number of expected customers for a subset of products of a given set of candidate products Q as we do in our model. Xu et al. [30] propose two types of product selections models, called k-best selling products (k-BSP) and (b) k-best-benefit products (k-BBP), that increase the expected sales in the market. This work also develops a probabilistic product adoption model by assuming that a product satisfies a customer iff \(p^i\ge c^i\). However, the product adoption models of these works [13, 30] are not realistic in the sense that customers do not always wish to minimize/maximize the attributes of a product, rather they may prefer certain ranges in them. These attributes can also be modeled in our approach by setting the customer preferences to their MIN/MAXes. Therefore, we argue that our product adoption model is more robust compared to [13] and [30] as we model product adoption among the customers through around-by semantics (i.e., dynamic dominance and skyline) and more sustainable compared to [1, 29] as we model product selection by considering both the customer and the manufacturer perspectives (recall from Sect. 3.1.2).
3.4 Design decision models versus k-MPP queries
The proposed product selection strategy is orthogonal to the design decision models studied in other domain [7, 16]. Like [7, 16], the k-MPP queries consider the product (i.e., the product attributes), the consumer (i.e., the customer), the firm (i.e., the manufacturer) and its competitors (i.e., competitors’ products in the market). The k-MPP queries tend to maximize the profit (i.e., modeled via the expected number of total customers) by selecting a subset of products from Q that can beat the competitors’ product by fulfilling the demand of the consumer. Unlike top-k queries [3] where weights are used to rank the objects, the demand of a product is measured by the customers c that appears in RSL(c) and the alternative choices of a customer c are determined by the products p that appears in DSL(c).
4 Complexity analysis
This section analyzes the complexity of processing k-MPP query by providing a serial execution approach and then, a hypothetical parallel approach for improving the efficiency. The k-MPP is a special type of query which requires a different kind of data indexing scheme to expedite its execution time. We conclude that existing data indexing policies and parallel skyline computing techniques are inefficient for k-MPP queries.
4.1 Serial execution of k-MPP query
To process k-MPP query, one can start computing the dynamic skyline of each customer \(c\in C\) and then, check whether DSL(c) contains the product \(q\in Q\). Once we know DSL(c) of each customer \(c\in C\), we can compute Pr(c, q|P) (or \(Pr(c, q|P\cup Q)\)) and E(C, q|P) (or \(E(C, q|P\cup Q)\)) for each product \(q\in Q\). However, this naive solution is not efficient as we need to compute |C| dynamic skylines (recall \(C\gg P\gg Q\)). The dynamic skyline is itself computationally very expensive [18]. Not all products in Q may also be readily available for computing DSL(c) offline to expedite the computation, e.g., some of the products may not be manufactured yet, rather they are prototyped on the fly based on the demand received from the customers (e.g., product survey), which requires online computation of DSL(c). An alternative/better solution to the above is to compute the reverse skyline of each query product \(q\in Q\), then compute the dynamic skyline only for those customers that appear in the reverse skyline of q (avoiding unnecessary dynamic skyline computations) as suggested in Eq. 4. We term this as the baseline approach. The efficiency of this solution depends on the number of query products in Q, the time required to compute the reverse skylines and the dynamic skylines of the customers c that appear in RSL(q) of \(q\in Q\).
4.2 Parallel processing of k-MPP query
4.3 Limitations of existing approaches
As outlined in Sect. 4.1, the basic units of computation for processing k-MPP queries are dynamic and reverse skylines and the efficiency of realizing a k-MPP query in a hypothetical ideal system is much dependent on parallelizing these units as conceptualized in Sect. 4.2. Though there are a number of established works on parallelizing traditional skyline query exist in the literature ([10, 17, 21, 24, 31, 32] for survey), none of them are suitable for processing k-MPP queries in parallel (recall that k-MPP relies on dynamic and reverse skylines, not the traditional skyline query). To apply these technique for our problem, we need to transform every objects into a new space considering each query (\(q\in Q\) for reverse skyline) as well as the customer (\(c\in C\) for dynamic skyline) as the origin, which is certainly not efficient for solving k-MPP.
The only work on computing dynamic and reverse skylines in parallel is proposed in [19] using quad-tree. However, the quad-tree index is query dependent and is designed to facilitate the computation of a single dynamic and reverse skyline query in parallel, not multiple skyline queries. There is also no technique for computing bichromatic reverse skyline involving both competitor’s products and customer preferences using quad-tree index. The index needs to be rebuilt every time if we want to process a new dynamic or reverse skyline query. Therefore, the approach proposed in [19] cannot omit the data index-building time from \(T_{RSL}\) and \(T_{DSL}\) in Eq. 10 for processing k-MPP query in parallel. To the best of our knowledge, there is also no work on grouping multiple skyline queries and processing them together by sharing their data space and computation.
5 Our approach
This section presents our approach for processing the k-MPP product selection queries in parallel. Firstly, we design a simple yet very efficient grid-based index structure, which is query independent, and therefore, the resultant index is reusable. Then, we show how to efficiently compute the basic units of a k-MPP query by reducing their search spaces. We also present an approach for grouping multiple query products \(q\in Q\) together based on their (extended) search spaces and thereby, expedite the processing of the k-MPP further.
5.1 Data indexing
We partition the whole data space into regular grids. Then, we index our dataset P and C by scanning them once. We also carefully select some of the products \(p\in P\) as pivots to establish the partitionwise dominances.
5.1.1 Index structure
Example 11
Consider the data objects given in Fig. 1. A \(3\times 3\) grid index structure of this dataset is shown in Fig. 6. In this index structure, we have 9 partitions: \(D_0, D_1\)..., \(D_8\) and the corresponding positional vectors of these partitions are: \(<0,0>, <0,1>\), ..., \(<2,2>\), where \(\delta ^1=\frac{24}{3}=8\) and \(\delta ^2=\frac{24}{3}=8\). Here, we assume \(max(ob^i)=24,\forall i\in \{1,2,\ldots ,d\}\). The range of values covered in the \(1^{st}\) and \(2^{nd}\) dimensions of partition \(D_1\) are [0, 8) and [8, 16), respectively. Similarly, the range of values covered in the \(1^{st}\) and \(2^{nd}\) dimensions of \(D_8\) are [16, 24) and [16, 24), respectively. The location of \(p_2\) is \(<\frac{6}{8}, \frac{18}{8}>=<0,2>\), i.e., partition \(D_2\).
5.1.2 Index creation
To create the index, we scan the data objects in P and C sequentially and then, rewrite them in text file(s) as data objects separated by newlines as shown in Fig. 7a. Each data object is modeled as containing the following information: (a) the vector \(pos_l\) of the partition \(D_l\) containing the data object ob; (b) the id of ob and (c) values of the data object ob. We maintain two different text files, one for products P, denoted by “gridproducts.txt” and another one for customers C, denoted by “gridcustomers.txt.” The rationale is that we can scan the indexed products once and pass it to each skyline query if P fits entirely in memory (\(C\gg P\)). We can also scan objects from C page by page and process them one after another for computing the reverse skylines.
While scanning and rewriting indexed objects, we save information about the partitions of the indexed data objects (e.g., positional vector of the partition, #product objects and #customer objects in it). When we finish scanning objects from P and C, we write the information about the partitions and the pivot objects of the partition in a text file, denoted by “gridinfo.txt,” consisting of the followings: (a) a grid header and (b) partition infos as shown in Fig. 7b. The grid header consists of the following information: (i) #dimensions; (ii) grid size (n) and (iii) the \(\delta \). The partition info(s) contain information about the non-empty partitions consisting of the followings (i) \(pos_l\) of the partition \(D_l\); (ii) #product objects; (iii) #customer objects; (iv) #pivot objects and (v) the values of the pivot objects. We do not save information about an empty partition, i.e., partitions that do not contain any type of data object. The index is built once and shared by all queries.
5.1.3 Pivot object and partitionwise dominance
Definition 12
A partition \(D_1\) dominates another partition \(D_2\) w.r.t. a data object \(ob_3\), denoted by \(D_1\prec _{ob_3}D_2\), if every data object \(ob_2\in D_2\) is dominated by a data object \(ob_1\in D_1\) w.r.t. \(ob_3\), i.e., \(ob_1\prec _{ob_3}ob_2\). If \(ob_1\) is a product object from P, we say \(D_1\) productwise dominates \(D_2\) and is denoted by \(D_1\prec ^P_{ob_3}D_2\). We term the product object \(ob_1\) as the pivot object of \(D_1\).
Example 12
Consider the datasets given in Fig. 1 and the query product \(q_1(12, 12)\) as shown in Fig. 6. According to Definition 1, every object in \(D_2\) is dominated by \(p_4\in D_4\) w.r.t. \(q_1\). Therefore, \(D_2\) is dominated by \(D_4\) w.r.t \(q_1\), i.e., \(D_4\prec _{q_1} D_2\). Here, \(p_4\) is the pivot object for \(D_4\) which is used to establish \(D_4\prec _{q_1} D_2\) and \(D_4\prec ^P_{q_1} D_2\). Similarly, \(p_4\) can also be used to establish the following relationships: \(D_4\prec _{c_5} D_2\) and \(D_4\prec ^P_{c_5} D_2\).
To establish the partitionwise dominance between partitions \(D_1\) and \(D_2\) w.r.t. \(ob_3\), we do not need to check the pairwise dominance between the data object \(ob_1\in D_1\) and every data object \(ob_2\in D_2\). We know that the data objects of a partition \(D_2\) are bounded by its hypothetical corner objects (as marked by the red circles in Fig. 6). In a d-dimensional data space, a partition has \(2^d\) corner objects. The values of these corner objects of \(D_2\) in the \(i\mathrm{th}\) dimension are: \(<pos_2^i\times \delta ^i+b\times \delta ^i>, \forall b\in \{0,1\}\). We only need to check the pairwise dominance of \(ob_1\) with these hypothetical corner objects of \(D_2\) w.r.t. \(ob_3\). If \(ob_1\in D_1\) dominates all of these corner objects of \(D_2\) w.r.t. \(ob_3\), then we can ensure that all data objects of \(ob_2\in D_2\) will be dominated by \(ob_1\in D_1\) w.r.t. \(ob_3\). Therefore, we save \(ob_1\) as pivot objects for \(D_1\) to establish \(D_1\prec _{ob_3} D_2\) in “gridinfo.txt.”
Example 13
Consider the datasets given in Fig. 1, the index structure and the query product \(q_1(12, 12)\) as shown in Fig. 6. The hypothetical corner objects of \(D_2\) are marked as red circles in Fig. 6. It is easy to verify that all of these corner objects of \(D_2\) are dominated by \(p_4\in D_4\) w.r.t. \(q_1\). We get \(D_4\prec _{q_1} D_2\) and also, \(D_4\prec ^P_{q_1} D_2\). Therefore, \(p_4\) is stored as pivot object for \(D_4\) in “gridinfo.txt” to establish the partitionwise dominance with \(D_2\) in D w.r.t. \(q_1\).
Lemma 1
If a partition \(D_1\) dominates another partition \(D_2\) w.r.t. a data object ob and \(D_2\) dominates \(D_3\) w.r.t. ob too, then \(D_1\) also dominates \(D_3\) w.r.t. ob, i.e., \(D_1\prec _{ob}D_2\) and \(D_2\prec _{ob}D_3\implies D_1\prec _{ob}D_3\).
Proof
Let us assume that there are three data objects \(ob_1\in D_1, ob_2\in D_2\) and \(ob_3\in D_3\). The proof of this lemma then immediately follows the transitivity: \(ob_1\prec _{ob} ob_2\text { and } ob_2\prec _{ob} ob_3\implies ob_1\prec _{ob} ob_3\), and the partitionwise dominance given in Definition 12. \(\square \)
5.2 Processing the basic computational units
5.2.1 Computing dynamic skyline of a customer
We already know that a dynamic skyline of a customer c retrieves all products \(p_1\in P\) that are not dynamically dominated by other products \(p_2\in P\) w.r.t. the customer c as explained in Sect. 2.
Lemma 2
We can safely remove every product \(p_2\) contained in a partition \(D_2\) for processing DSL(c) iff \(\exists D_1\in D: D_1\prec ^P_c D_2\).
Proof
From definition 12, we know that \(\exists p_1\in D_1\) such that the products \(p_2\in D_2\) are dominated by \(p_1\) w.r.t. the customer c as \(D_1\prec ^P_c D_2\). Therefore, \(p_2\in D_2\) can not appear in the dynamic skyline of the customer c and can be removed safely while computing DSL(c). \(\square \)
The search space \(\mathcal {N}_c(D)\) of a dynamic skyline query c consists of all non-dominating partitions in D (according to Lemma 2 and Lemma 1), which can be determined by accessing the information stored in “gridinfo.txt.” As we are going to retrieve products \(p_1\in P\) that are preferable to c than other products \(p_2\in P\), we need to access the product data stored in “gridproducts.txt” only. The steps of computing DSL(c) are described below:
(1) Computing\(\mathcal {N}_c(D)\)forDSL(c): We initialize a FIFO list \(\mathcal {L}\) by the location of the customer c, i.e., loc(c). Then, we do the following until \(\mathcal {L}\) is empty: (a) pop an element \(D_1\) from \(\mathcal {L}\) and add it to \(\mathcal {N}_c(D)\); (b) compute the immediate neighbors \({\mathcal {N}}_1(D_1)\) of \(D_1\) as follows: \(<pos_1^i+b>, \forall b\in \{-1, 0, +1\} \text{ and } i\in \{1,2,...,d\}\); and (c) \(\forall D_2\in {\mathcal {N}}_1(D_1)\), if \(D_2\) is not dominated by any partition in \(\mathcal {N}_c(D)\) (Lemma 2) w.r.t. c, then add \(D_2\) to \(\mathcal {N}_c(D)\) and also, insert \(D_2\) into \(\mathcal {L}\). We add \(D_2\) to \(\mathcal {L}\) because we can not guarantee that the immediate neighbors of \(D_2\) will be dominated either by \(D_1\) or \(D_2\) , e.g., neighboring partitions of \(D_2\) that are positioned into the vertical and horizontal directions. The above steps are pseudo-coded in Algorithm 1.
(2) Retrieving Products fromP: We access the products p stored in “gridproducts.txt” sequentially and filter them as follows: if \(loc(p)\in \mathcal {N}_c(D)\), then we insert p into a min heap \(\mathcal {H}_c\), otherwise drop it. To compare two objects for the min heap \(\mathcal {H}_c\), we use the Euclidean distances of products \(p_1\) and \(p_2\) to c. We assume that the min heap \(\mathcal {H}_c\) can be stored in the main memory (recall \(P\ll C\)). These steps are pseudo-coded in lines 4–6 of Algorithm 2.
Correctness of Algorithm2. The lines 2–10 of Algorithm 1 computes the search space, i.e., the set of non-dominating partitions \(\mathcal {N}_c(D)\) for DSL(c) starting from the loc(c). Then, it grows \(\mathcal {N}_c(D)\) by repeatedly adding the non-dominating neighboring partitions into it. It stops only if the neighboring partitions are dominated by some of the partitions already added to \(\mathcal {N}_c(D)\). Therefore, we can say that \(\mathcal {N}_c(D)\) consists of the legitimate partitions which may contain the dynamic skyline objects for DSL(c). The lines 4–6 of Algorithm 2 scan the product objects from “gridproduct.txt” and insert them in a min heap \(\mathcal {H}_c\) in order of their distances to c. The lines 7–11 of Algorithm 2 compute DSL(c) by repeatedly retrieving the root product from \(\mathcal {H}_c\). The mean heap \(\mathcal {H}_c\) ensures that the root product is either dominated by some products already added in DSL(c) or part of it [18]. Therefore, we correctly compute the dynamic skyline of a customer \(c\in C\), i.e., DSL(c).
5.2.2 Computing the reverse skyline of a product
A reverse skyline query of a query product q, RSL(q), consists of all customers \(c\in C\) that prefer to buy q over other products \(p\in P\). The RSL(q) is computed by retrieving all customers \(c\in C\) that are not dynamically dominated by the midskylines of products \(p\in P\) w.r.t. q in each of its orthant, i.e., \(m\not \prec _q c\), where m is the midskyline for P as explained in Sect. 2.
Observation-1: If a product p dominates a customer c w.r.t. a query product q in an orthant O, then the midpoint m of p also dominates c w.r.t. q. Therefore, the customer c cannot be in the reverse skyline of q if \(p\prec _{q} c\) in an orthant O.
Example 14
Consider the datasets given in Fig. 1, the index structure, the query product \(q_1(12, 12)\), customer \(c_3\in C\) and the midpoint \(m_4\) of product \(p_4\in P\) w.r.t. \(q_1\) as shown in Fig. 8. It is easy to verify that \(c_3\in C\) is dominated by \(m_4\) w.r.t. \(q_1\) in orthant \(O_4\) as \(p_4\prec _{q_1} c_3\). Therefore, the customer \(c_3\) cannot be in \(RSL(q_1)\). Similarly, any customer in \(D_2\) dominated by \(p_2\) or \(p_3\) w.r.t. \(q_1\) would also be dominated by \(m_4\) w.r.t. \(q_1\) and therefore, could not be in \(RSL(q_1)\).
Lemma 3
We can safely remove each product \(p_2\) and customer \(c_2\) contained in \(D_2\) for processing RSL(q) iff (1) \(\exists \text{ pivot } p\in D_1\) such that the pivot p and the partition \(D_2\) are in the same orthant of q and (2) \(D_1\prec ^P_q D_2\).
Proof
The proof of this lemma immediately follows the Observation 1 and 2, also the definition of partitionwise dominance given in the Definition 12. \(\square \)
The search space \(\mathcal {N}_q(D)\) of a reverse skyline query q consists of all non-dominating partitions (according to Lemma 3 and Lemma 1), which can be determined by accessing the information stored in “gridinfo.txt” only. However, as we are going to retrieve customers that find q preferable to other products in P, we need to access both product data and customer data stored in “gridproducts.txt” and “gridcustomers.txt,” respectively.
The steps of computing RSL(q) are follows:
(1) Computing\(\mathcal {N}_q(D)\)forRSL(q): We initialize a FIFO list \(\mathcal {L}\) by the location of the query q, i.e., loc(q). Then, we do the following until \(\mathcal {L}\) is empty: (a) pop an element \(D_1\) from L and add it to \(\mathcal {N}_q(D)\); (b) compute the immediate neighbors \(N_1(D_1)\) of \(D_1\) as follows: \(<pos_1^i+b>, \forall b\in \{-1, 0, +1\} \text{ and } i\in \{1,2,...,d\}\); and (c) \(\forall D_2\in N_1(D_1)\), if \(D_2\) is not dominated by any partition in \(\mathcal {N}_q(D)\) (Lemma 3), then add \(D_2\) to \(\mathcal {N}_q(D)\) and also, insert \(D_2\) into \(\mathcal {L}\). These steps are pseudo-coded in lines 2–10 of Algorithm 3.
(2) Retrieving Products fromP: We access the product objects p stored in “gridproducts.txt” sequentially and filter them as follows: if \(loc(p)\in \mathcal {N}_q(D)\), then we insert p into a min heap \(\mathcal {H}_q\), otherwise drop it. To compare two objects for the min heap \(\mathcal {H}_q\), we use the Euclidean distances of products \(p_1\) and \(p_2\) to q. We assume that the min heap \(\mathcal {H}_q\) can be stored in the main memory (recall \(P\ll C\)). These steps are pseudo-coded in lines 2–4 of Algorithm 4.
(3) Computing Midpoint Skyline ofP: Firstly, we initialize the midpoint skyline set \(\mathcal {M}\) to \(\emptyset \). Then, we do the following until \(\mathcal {H}_q\) is empty: (a) retrieve the root product \(p_1\) from \(\mathcal {H}_q\); and (b) add the midpoint \(m_1\) of \(p_1\) to \(\mathcal {M}\) if \(\not \exists m_2\in \mathcal {M}: m_2\prec _q m_1\) and \(m_1\) and \(m_2\) are in the same orthant O of the query product q (Observation-2). These steps are pseudo-coded in lines 5–10 of Algorithm 4.
5.2.3 Optimizing partitionwise dominance for RSL
5.3 Processing k-MPP query
To process a k-MPP query, we need to do the following: (a) computing the reverse skyline for each query product \(q\in Q\); (b) computing the dynamic skyline of each customer \(c\in RSL(q)\); and (c) selecting the k query products from Q based on their contribution in the market. However, the processing of dynamic skyline of a customer \(c\in RSL(q)\) varies for \(k-MPP_{ind}\) and \(k-MPP_{dep}\). To compute DSL(c) and RSL(q) for \(k-MPP_{dep}\) query, we use \(P\cup Q\setminus q\) as P.
5.3.1 An straightforward strategy
5.3.2 An optimized strategy
This section describes an optimized strategy for computing k-MPP product selection query in parallel based on: (1) grouping similar query products in Q and processing their reverse skylines together; and (2) computing DSL(c) offline and updating it for the query products in \(q\in Q\). Our approach is described below.
A) Grouping similar query products We propose to group queries in Q based on their similarities to achieve the followings: (1) the reverse skylines of the similar query products can be processed together by a single working node and (2) the disk accesses can be reduced by sharing the data objects among the basic computational units while processing the k-MPP query.
Definition 13
(Strong similarity) Two query products \(q_1\) and \(q_2\) in Q are said to be strongly similar iff \(\mathcal {N}_{q_1}(D)=\mathcal {N}_{q_2}(D)\).
Lemma 4
Two query products \(q_1\) and \(q_2\) share the same extended search space, i.e., \(\mathcal {N}_{q_1}^+(D)=\mathcal {N}_{q_2}^+(D)\), if (1) \(loc(q_1)=loc(q_2)\) (rename them as \(loc(q_{12})\)); and (2) \(\forall p\in loc(q_{12}), O_{q_1}(p)=O_{q_2}(p)\) in \(loc(q_{12})\), where \(p\in P\) is a pivot for \(loc(q_{12})\).
Proof
Assume that the extended search spaces \(\mathcal {N}_{q_1}^+(D)\) and \(\mathcal {N}_{q_2}^+(D)\) are not the same but the conditions (1) and (2) in Lemma 4 hold for \(q_1\) and \(q_2\). Also, assume that a partition \(D_1\) does not appear in \(\mathcal {N}_{q_1}^+(D)\) but does in \(\mathcal {N}_{q_2}^+(D)\). Since \(D_1\not \in \mathcal {N}_{q_1}^+(D)\), there must be a pivot product \(p_1\) in \(loc(q_1)\) such that (1) \(O_q(p_1)=O_q(D_1)\) and (2) \(p_1\prec _{q_1}D_1\). However, the above cannot happen. Since, \(loc(q_1)=loc(q_2)\) and \(p_1\) appears in the same orthant for both \(q_1\) and \(q_2\), then \(p_1\) must also dominate \(D_1\) w.r.t. \(q_2\), i.e., \(p_1\prec _{q_2} D_1\). \(\square \)
Definition 14
(Weak similarity) Two queries \(q_1\) and \(q_2\) in the query product set Q are said to be weakly similar if (1) \(loc(q_1)=loc(q_2)\); and (2) \(\mathcal {N}_{q_1}^+(D)=\mathcal {N}_{q_2}^+(D)\).
Example 15
Consider the datasets given in Fig. 1, the index structure and the queries as shown in Fig 10. The location of \(q_1\) and \(q_3\) is \(D_4\). Assume that \(p_4\) is a pivot object for \(D_4\). It is easy to verify that \(p_4\) is in the same orthant for both \(q_1\) and \(q_3\), i.e., (1) if \(p_4^i\le q_1^i\), then \(p_4^i\le q_3^i\) and (2) if \(p_4^i>q_1^i\), then \(p_4^i>q_3^i\). Therefore, the (extended) search spaces of the reverse skylines for \(q_1\) and \(q_3\) are the same, i.e., \(\{D_0, D_1, D_3, D_4, D_5, D_6, D_7, D_8\}\). We say \(q_1\) and \(q_3\) are similar.
B) ComputingDSL(c) offline and updating forQ To compute k-MPP, we need to compute the dynamic skyline for each customer \(c\in RSL(q)\). The dynamic skyline is computationally known to be very expensive [18]. To expedite this operation, we precompute the dynamic skyline of each customer \(c\in C\) considering only the existing products P. Then, we update this precomputed dynamic skyline for the query product set Q. The precomputed dynamic skylines are stored as fixed-length records in a text file “dynamicskylines.txt.” Each record in the file consist of the identifiers (comma separated) of the products \(p\in P\) that appear in the dynamic skyline of the customer \(c\in C\) (spaces are padded at the end if needed). The first record represents the dynamic skyline of the first customer \(c_1\) in C and so on.
The pseudo-code of efficiently computing k-MPP query in parallel considering the above is given in Algorithm 8. Firstly, the algorithm group queries based on their predicted \(\mathcal {N}_q^+(D)\) (line 4) computed by the master node. Then, for each group of queries \(Q_1\), a worker node is assigned by the master node to do the following: if \(|Q_1|=1\), process it by calling Algorithm 5, otherwise: (a) compute the extended search space \(\mathcal {N}_q^+(D)\) and the filtered products \(P^\prime \subseteq P\) and (b) compute RSL(q) for each \(q\in Q_1\) as follows (i) compute the actual search space \(\mathcal {N}_q(D)\) by refining \(\mathcal {N}_q^+(D)\) for q, (ii) compute midpoint skyline \(\mathcal {M}(q) \) based on \(\mathcal {N}_q(D)\) and \(P^\prime \) and (iii) finally, compute RSL(q) based on \(\mathcal {N}_q(D), \mathcal {M}(q)\) and C (lines 11–17). Lines 18–23 compute the dynamic skyline for each \(c\in RSL(q)\) by retrieving the precomputed DSL(c) and then updating it for Q. Finally, line 24 selects k best query products in Q based on E(C, q|P).
6 Experiments
This section presents the experimental studies of our approach. More specifically, we show the effectiveness of our product selection model as well as evaluate the performance of processing k-MPP query in parallel by comparing our approach with the existing counterparts.
6.1 Settings
Datasets We empirically evaluate the performance of our proposed technique for processing k-MPP query in parallel using real data, namely CarDB^{1}, consisting of \(2\times 10^5\) objects. This is a six-dimensional dataset with attributes: make, model, year, price, mileage and location. The three numerical attributes year, price and mileage are considered in our experiments. The dataset is normalized into the range [0, 1]. We randomly select half of these normalized car objects as products and the rest as the customers. The use of the CarDB dataset makes excellent sense in our experiment. The k-MPP can be exploited to estimate the market contribution of the advertised cars for a particular seller and thereby, select the k-most promising cars for designing specialized promotions. We also present experimental results based on synthetic data: uniform (UN), correlated (CO) and anti-correlated (AC), consisting of varying number of products, customers and dimensions. The cardinalities of these datasets in products and customers are \(1\times 10^5, 2\times 10^5, 3\times 10^5, 4\times 10^5\) and \(5\times 10^5\). The dimensionality is in the range of 2–4. Examples of these datasets consisting of \(2\times 10^5\) objects in two dimensions are shown in Fig. 11a–d.
Queries For each experiment we run a number of queries generated (synthetic) and selected (CarDB) randomly by following the distribution of the tested datasets.
6.2 Effectiveness study
We know that the market contribution measure effectively combines both the customer and the manufacturer perspective into the same metric, which is (theoretically) more realistic than the influence score measure as demonstrated in Sects. 3.1 and 3.4. This section evaluates how effectively our approach trades-off between the valued customers and the influence score. Recall that the valued customers are those customers in the market that are prone to prefer the manufacturer’s own products as explained in Sect. 3.1.
6.3 Data indexing evaluation
6.4 Performance study
This section studies the execution efficiency of processing k-MPP in parallel based on our index structure. More specifically, we study the effect of different parameter settings on the efficiency of the proposed approaches SROND (the Baseline), SROFD (a variation of the Baseline), PROND (the straightforward strategy) and PROFD (a variation of the straightforward strategy) as summarized in Table 3. We also compare the efficiency of our approach with the existing approach of computing dynamic and reverse skylines in parallel using quad-tree (QTree) indexing scheme [19]. As our k-MPP query involves both products and customer preferences, we extend the mono-chromatic reverse skyline computing technique given in [19] for bichromatic reverse skyline using the QTree index as follows: (a) a QTree node \(n_1\) is marked as pruned for a \(q\in Q\) if there exists a QTree node \(n_2\) such that \(n_1\) and \(n_2\) are in the same orthant of q and \(p_2\in n_2\) dominates all corner points of \(n_1\) w.r.t q; (b) all products \(p\in P\) are scanned and used to compute midpoint skyline for q if not located in the pruned node and (c) finally, all \(c\in C\) are scanned and marked as plausible customer for q (i.e., reverse skyline point) if not located in the pruned node of the QTree and not dominated by any midskyline point. We utilize 1000 samples (selected by applying the reservoir sampling method) and split threshold 40 to build the QTree for best performance as suggested in [19]. We compare two variants of this approach, i.e., QPROND and QPROFD as summarized in Table 3.
6.4.1 Effect of dimensionality
We study the effect of data dimensions on our proposed algorithms by setting the number of worker nodes (number of threads in the thread pool of Java) to 10. We run experiments on 2–3 dimensional real CarDB datasets and 2–4-dimensional synthetic datasets. The cardinality is set to \(1\times 10^5\) for both products and customers. To index each tested dataset, the grid size (n) is established empirically. The value of k is set to 20 for k-MPP\(_{dep}\). We run 100 queries following the distribution of the tested dataset. The results are shown in Fig. 15. We see that the execution time of every approach increases if the number of dimension increases in general. However, the proposed parallel algorithms PROFD and PROND take far less time compared to the baseline approaches, i.e., SROFD and SROND and significantly outperform the existing counterparts, i.e., QROFD and QROND for the increased number of dimensions in the datasets. The existing QROFD and QROND performs worst compared to our approach because of its query dependent data indices as per our analysis given in Sect. 4.3. The existing counterpart QROND is not scalable in higher dimensions as shown in Fig. 15b. These results demonstrate that the baseline approaches as well as the existing counterparts are not suitable for processing k-MPP queries. We conclude that the proposed reusable query-independent grid-based data indexing and our approach of processing k-MPP queries are efficient for processing k-MPP queries in parallel.
6.4.2 Effect of query products
This section studies the effect of cardinality in the query products on the execution time of our approaches. We run experiments on CarDB and CO datasets in two dimensions by varying the number of query products, |Q|, from 20–500, setting \(|P|=1\times 10^5, |C|=1\times 10^5, k\) to 25 for k-MPP\(_{dep}\) and the number of worker nodes (threads in Java) to 20. The results are shown in Fig. 16. The grid size (n) is established empirically. From Fig. 16, we see that the execution time of every technique increases if the number of query products in Q increases. However, our approaches PROFD and PROND are much faster compared to the existing counterparts QROFD and QROND. The existing counterpart QROND is not scalable in terms of |Q| due to its query dependent quad-tree index. On the other hand, our approaches are highly scalable in terms of |Q|.
6.4.3 Effect of cardinality in datasets
Different approaches of processing k-MPP
Acronym | Description |
---|---|
SROND | Serial RSL+ ONline DSL |
SROFD | Serial RSL + OFfline DSL |
PROND | Parallel RSL + ONline DSL |
PROFD | Parallel RSL + OFfline DSL |
QPROND | Parallel RSL + ONline DSL using QTree [19] |
QPROFD | Parallel RSL + OFfline DSL using QTree [19] |
PMROND | Parallel Multiple RSL + ONline DSL |
PMROFD | Parallel Multiple RSL + OFfline DSL |
6.4.4 Effect of threads
6.4.5 Effect of data indexing
The efficiency (millisecs) of k-MPP variants
Dataset | PROFD | PROND | ||
---|---|---|---|---|
k-MPP\(_{ind}\) | k-MPP\(_{dep}\) | k-MPP\(_{ind}\) | k-MPP\(_{dep}\) | |
CarDB | 6689 | 6192 | 12,017 | 12,093 |
UN | 2772 | 2847 | 11,823 | 11,661 |
CO | 4775 | 4734 | 13,360 | 13,302 |
AC | 3860 | 3720 | 10,484 | 10,471 |
6.4.6 The k-MPP\(_{ind}\) versus k-MPP\(_{dep}\) query
6.4.7 Evaluation of optimized strategy
This section evaluates the efficiency of the optimized strategy for processing k-MPP queries in parallel. More specifically, we compare the performances of the following approaches: PMROND (a variation of the optimized strategy) and PMROFD (the optimized strategy) as summarized in Table 3. We create a number of clustered queries (to increase the similarities of their search spaces) by setting \(|Q|=100\) for UN and \(|Q|=200\) for real CarDB datasets in two dimensions and varying the grid size from 4 to 40. However, the distributions of \(q\in Q\) in the resultant clusters (i.e., group of similar queries) are not uniform as shown in Fig. 21. We conduct two experiments to evaluate the performance of the optimized strategy as given below.
Effect of data indexing We study the effect of grid size (n) on the efficiency of the optimized strategy by setting k to 25 for k-MPP\(_{dep}, |P|=1\times 10^5, |C|=1\times 10^5\) and the number of threads to 20. The results are shown in Fig. 22. We see that the optimized strategies PMROND and PMROFD take less time compared to the straightforward strategies PROND and PROFD for lower grid sizes (n), except when the distributions of \(q\in Q\) in the clusters are too skewed, e.g., the first few clusters contain most of the query products \(q\in Q\) in CarDB for \(n=8\) as Fig. 21b and the optimized strategies perform worst (i.e., the responsible thread is heavily loaded). However, the optimized strategy tends to perform similarly to the straightforward strategy for the increased grid sizes (n) as we cannot form clusters of similar queries to share their search spaces and thereby, computations to expedite the efficiency.
Effect of threads We study the effect of threads in Java on the efficiency of the optimized strategy again in UN and CarDB datasets by setting the grid size (n) for them to 8 and 25, respectively. We set the other parameters similar to the first experiment. The results are shown in Fig. 23. We see that the optimized as well as the straightforward strategies achieve the best efficiency when the number of threads set in the range 10 to 20 for the CarDB dataset. However, we see an exception in the UN dataset where the optimized strategy PMROND stabilizes when the number of threads set in the range 10–20, then again offers a further improvement on the efficiency when the number of threads set to 40 as shown in Fig. 23a. We leave the optimization of these parameters as future research direction.
7 Related work
7.1 Customer–product relationship
There are a number of works that study the computational aspects of customer–product relationships. Li et al. [11] propose a data cube framework called DADA to analyze dominance relationships from a microeconomic perspective. The framework is aimed to provide insights about the dominant relationships between the products and their potential buyers and supports three types of dominant relationship queries, e.g., (a) Linear Optimization Queries, (b) Subspace Analysis Queries, and (c) Comparative Dominant Queries. Given a set of products P and a set of customers C, Wu et al. [29] propose an improved algorithm for computing the influence set of a query product q. The influence of the query product q is measured as the cardinality of the reverse skyline of q termed as influence set for q, i.e., \(IS(q)=|RSL(q)|\). In [1], Arvanitis et al. propose an approach for computing k-Most Attractive Candidates (k-MAC). Given a set of candidate query products Q and an integer \(k>1\) (as well as P and C as in [29]), k-MAC query discovers the k-most attractive candidate set \(Q^\prime \subseteq Q\) such that \(|Q^\prime |=k\) and the joint influence score of \(Q^\prime \), defined as \(IS(Q^\prime )=|\bigcup _{q\in Q^\prime }{RSL(q)}|\), is maximized. In these two works, every customer c appearing in the RSL(p) is assumed to contribute 100 % for the sustainment of the product p in the market.
In [13], Lin et al. propose an approach for selecting k products from a set of candidate products such that the expected number of the total customers is maximized known as k-most demanding products (k-MDP) discovering. The authors propose an exact algorithm for k-MDP by estimating the upper bound of the expected number of the total customers. They also offer a greedy algorithm which is scalable w.r.t. k. In [9], Koh et al. presents an approach of computing k-most favorite products based on reverse top-t queries, which is NP-hard. They design an incremental greedy approach to find an approximate solution with guaranteed quality exploiting the properties of the top-t queries and skyline queries. In [30], Xu et al., propose a product adoption model and a greedy-based approximation algorithm for selecting k products that can maximize the sales for the manufacturer. In [28] Wu et al. propose approaches for discovering the promotive subspaces in which the product objects becomes prominent.
In [27] Wu et al. propose efficient approaches for processing region-based promotion queries that can discover the top-k-most interesting regions for effective promotion of a product object in which it is top-ranked. In [14, 15], Miah et al. propose approaches for finding the best set of attributes of a new product so that it can stand out in the crowd of existing competitive products. Given a set of existing products P and a set of given products Q, Wan et al. [25, 26] and Peng et al. [20] propose approaches for finding a set of k best possible products from Q such that this subset of products of Q are not dominated by the products in P. They describe several variants of this problem and provide solutions for them. In [8] Islam et al. propose an approach of establishing an automatic negotiation between a customer and a product by modifying some of the product’s attributes (e.g., price) and customer preferences for those attributes with minimum penalty.
In our work, we show that a customer does not contribute 100 % for the sustainment of a product in the market, rather only a fraction of it. Considering this, we propose a novel probability-based product adoption model for the customer and a product selection strategy for the manufacturer with maximum expected number of attracted customers based on dynamic and reverse skylines. We also propose a new type of query called finding k-MPP and a solution approach for it.
7.2 Computing skylines in parallel
There are a number of works on parallelizing the processing of skyline queries. However, none of them are suitable for parallelizing the execution of k-MPP query. In [24], Vlachou et al. propose an angle-based space partitioning for skyline query processing in a parallel using the hyperspherical coordinates of the data points. In [32], Zhang et al. present an object-based space partitioning approach for processing skyline queries in parallel. In [10], Kohler et al. propose a hyperplane data projection technique for computing skyline in parallel which is independent of data distribution. In [31], Zhang et al. propose MR-BNL algorithm by partitioning the data space into \(2^d\) subspaces according to the median of each dimension and then, computing the local skyline of every subspace in parallel. Finally, the global skyline is computed in a single machine from all the local skylines. In [17], Mullesgaard et al. propose grid-based data partitioning scheme for computing skyline in MapReduce. In [21], Pertesis and Doulkeridis propose an approach of processing skyline query in SpatialHadoop. However, these approaches are not suitable for dynamic/reverse skyline query processing and thereby, cannot solve our problem.
The work in [19] parallelizes a single dynamic/reverse skyline query using quad-tree index, not multiple reverse skylines. The quad-tree index is also query dependent and does not allow multiple queries to be grouped together. Therefore, this approach is not suitable for processing k-MPP query in parallel. In our work, we propose an approach for computing k-MPP query in parallel by designing a simple yet very efficient query-independent grid-based index structure. We also provide an approach for grouping queries based on their similarities and processing them together with parallelize the processing of k-MPP.
8 Conclusion
This paper presents an efficient approach for processing k-MPP product selection query and its variants. We design a simple yet very efficient query-independent grid-based index structure to partition the data space. We also establish the partitionwise dominances and several theoretical properties to reduce the search spaces of the basic computational units and filter the non-resultant data objects for computing k-MPP query in parallel. We also show how to improve the efficiency further by grouping queries based on their extended search spaces and processing them together. The effectiveness and efficiency of the proposed product selection model are demonstrated by conducting extensive experiments. Our model adds another level of assurance to identify the valued customers in the influence-based market research queries, which can be studied further.
We find only 67 query products are non-dominating w.r.t. attracting customers in the market, i.e., \(RSL(q)\ne \emptyset \).
Acknowledgments
The work is supported by the ARC Discovery Grant DP140103499. We would like to thank Dr. Robin Humble for his help on optimizing the performance of our programs in Swinburne HPC system and the anonymous reviewers for their insightful feedback.