In Sect. 4 we described the general flow of the approach. In this section we present how we implemented the different steps in Rakuten. The architecture, which is illustrated in Fig. 4, can be divided in four different layers: (i) the inputs, i.e., artifacts such as the Rakuten taxonomy, the search logs, etc.; (ii) the computational framework on which we run the different modules of the system; (iii) the processing layer, in charge of generating the intermediate artifacts from the input: sets of subtrees, tokenized item descriptions, word2vec models, etc.; (iv) the extension layer, in charge of extending the taxonomy using the intermediate artifacts and exposing the output as RDF. First we briefly describe the input layer and the computational framework and then we give closer looks to the processing and extension layers.
Input Layer: The system takes the following artifacts as input:
-
Rakuten Taxonomy: The Rakuten taxonomy contains 38,167 classes: 35 with depth 1, 405 with depth 2, 3,790 with depth 3, 15,849 with depth 4, and the remaining classes with depth 5. Each class has exactly one parent except for the root. Unfortunately it is not formalized in a well-known ontology language. The intended semantics behind the parent-child relation is the same one as owl:subclass.
-
Search and Browsing Logs: This dataset (2TB per year) contains information about all the search queries executed by users of Rakuten Ichiba: keywords, class being browsed, items browsed after the query, etc.
-
Item Descriptions: This dataset (800GB per year) contains the HTML pages describing each item sold in Rakuten Ichiba, together with the class in which they belong and other metadata irrelevant for this project.
Computing Framework: Given the size of the datasets, the system relies on distributed computing frameworks to perform computations at the level of the full taxonomy, i.e. computations which require considering all the classes. A Hive cluster with 1000 nodes is used to pre-process and analyze the query logs then score the search keywords. A Spark cluster with 200 nodes is used to select the subtrees for which the taxonomy needs to be extended. On the other hand, the processing of the individual subtrees is independent and bounded in memory and computational requirements for each instance. Therefore, the subtree-level pipeline, which consists of a series of Python scripts, is run on a standard cluster and does not require a particular computing framework to scale.
5.1 Processing Layer
Subtree Selector. This modules selects the subtrees in the Rakuten taxonomy that will be extended. Next we give concrete definition of the different abstract concepts listed in Sect. 4.
Need for Navigational Assistance: To measure the need for new discovery axes of a subtree, we compute its GMS diversity, defined as \(\exp (-\sum _{i} p_i\ln p_i)\) where the sum is over the items in the subtree and \(p_i\) is the proportion of the total GMS of the subtree which is due to the item i. This is the exponential of the Shannon entropy of the subtree’s item-level GMS which intuitively represents the effective number of items in a subtree making up its GMS. A subtree is said to have a high need for navigational assistance (NNA) if its effective number of items is more than \(Z_1 = 2^{15}\). It is said to have a low NNA, and is therefore discarded, if its effective number of items is less than \(Z_2 = 2^7\). These values for \(Z_1\) and \(Z_2\) are based on an initial exploration of the data and the final values will be decided by the catalog team.
Business Impact: The business impact is not used in addition to the NNA requirement as we found in practice that subtrees not discarded by the previous requirement have high enough business impact. Indeed, a counter-example would necessitate a large number of items with very low and almost evenly distributed sales volumes, which is not found in our datasets.
Semantic Homogeneity: We use search query logs to measure how semantically homogeneous a given subtree \(t_1\) is. For each node with depth 1 in \(t_1\), we compute the set of search keywords leading to that node (class). A keyword is said to lead to a class if a user searching for this keyword clicked on an item of this class immediately after the query. Then the subtree is said to be homogeneous if the number of such keywords is larger than \(Z_3 = 30\), a value determined empirically.
In Fig. 5(A) we show the algorithm that we use to find the subtrees in the taxonomy. For the sake of clarity we use a number of functions that we will informally define next. Let T a tree, C a class in R, and M a model built as explained above, and S a set of classes. Then T(i) returns the set of classes in T with depth i; \(\texttt {children}(C,T)\) returns the children of C in T; \(\texttt {subtree}(C,T)\) returns the subtree of T “hanging” from C; \(\texttt {subtree}(C,T,S)\) returns the subtree \(T_1\) of T hanging from C but restricting the nodes with depth 1 in \(T_1\) to those in S; \(\texttt {nna}(T)\) returns the effective number of items in T; \(\texttt {hom}(S)\) returns the homogeneity of T.
Property-Value Seed Extraction. We extract the initial set of properties and values (PV) from HTML tables and semi-structured text inputted by merchants in the items descriptions as these are easy to parse, quite accurate and reflect the domain knowledge of the merchants. There are several approaches in the literature to extract information from HTML tables [6, 10, 23], here we adopt the one used in [23] (Sect. 4.1.1) and use a slightly modified version of their implementation. Intuitively, the property names are first extracted from the headers of HTML tables in the item descriptions, and the associated values are found as the adjacent keywords either in the tables or in semi-structured text.
Model Training. The initial PV list obtained only from HTML tables and semi-structured text can lack a number of popular values, depending on the class. To increase the coverage of the property range we bootstrap the list using neural models of context similarity (word2vec). This module is in charge of training these models on the set of item descriptions within a given subtree.
We use word2vec’s CBOW model with negative sampling to find words which appear in item descriptions within similar contexts. The item descriptions are first stripped from non-text features such as HTML tags and URLs. Then, they are tokenized using Mecab, a Japanese language morphological analyzer of which only the tokenizing part is used. Two models are trained on the resulting data. The first one is directly trained on the tokenized descriptions seen as bags of words. The second one is trained on the tokenized descriptions after performing collocation using popular search keywords extracted previously (by the Keyword Ranker). Collocation is done in two steps due to the specificities of the Japanese language. The first step is to join adjacent tokens into popular “words”. Indeed, as the words are usually not separated by spaces or punctuation in Japanese, the tokenizer may cut words into several tokens. The second step is to join the resulting words into popular ngrams (up to trigrams).
After training we obtain two models trained on slightly different representations of the item descriptions. This module relies on the word2vec implementation of the library gensim.
5.2 Extension Layer
Seed Cleaning. The initial PV list extracted in the processing layer has usually a fairly high precision but is not usable as is. The first issue is the existence of redundant property names, meaning that the merchants use different words to identify the same concept. Redundant property names can either be (i) different terms, such as (manufacturer) and (maker), or (ii) the same term written in different alphabets or combinations thereof such as and (grape variety)Footnote 3. Another issue is the existence of values that are not useful as discovery axes, such as expiration date, model numbers or long ingredient lists. It is critical to point out that this information might be accurate but is not deemed relevant for the purpose of this project. This is why we do not aim to obtain a complete model of the domain, but to extract the core fragment that is relevant to the users.
Properties Aggregation: We first remove redundant property names. For this we develop the following score function. Let \(P_1\) and \(P_2\) be two properties in the seed, \(m_1\) and \(m_2\) their respective range sizes and n the size of the intersection of their ranges. Their similarity score function is defined as:
$$\begin{aligned} \textstyle L(P_1, P_2) = L_\mathrm {conf}\left( \frac{n}{\min (m_1, m_2)}\right) \times L_\mathrm {size}\left( \frac{\min (m_1, m_2)}{\max (m_1, m_2)}\right) - L_\mathrm {error}\left( \frac{1}{n}\right) \end{aligned}$$
where \(L_\mathrm {conf}\) is an increasing function representing the naive confidence that two properties are similar if they share many values respective to the maximum number of shared values, \(L_\mathrm {size}\) is a decreasing function which tempers that confidence if the properties have comparable range sizes and \(L_\mathrm {error}\) is an increasing function, with value 0 at 0, that increases as the number of shared values decrease, modelling the uncertainty over the score computed by \(L_\mathrm {conf}\) and \(L_\mathrm {size}\). In practice we use \(L_\mathrm {conf}(x) = x\), \(L_\mathrm {size}(x) = \exp (-ax)\) and \(L_\mathrm {error}(x) = bx\), with the ad-hoc parameters \(a=0.33\) and \(b=0.1\). Two properties are considered similar if their similarity score is larger than 0.1 and the equivalence classes for this relation are computed. For each of these classes, we pick the representative that occurs more often in item descriptions as final property name.
Observe that [23, 26] also performs an aggregation step. Intuitively, in [23] two properties are aggregated into a vector if they have a popular value (among merchants) in common. Two vectors are aggregated if the cosine similarity is above a given threshold. We empirically observed that the score function presented here can more accurately single out synonym properties since it does not depend on a single value occurring multiple times, but on the set of shared values and on the property sizes. In [26] they use cosine similarity the aggregate property names, again, in our experiments (using word2vec) we obtained better results using the score function presented here. As in [23], two properties are not aggregated if both are found to appear in the same item description during the seed extraction.
Values Filtering: The next step is to clean the properties’ ranges by discarding any non-popular value, measured by their frequency in the search queries logs (obtained by the Keyword Ranker). The result of these two steps is a small list of property-values pairs with low redundancy and high precision which is representative of the interests of the users.
Bootstrapping. We then expand the coverage of the PV seed obtained so far. The bootstrapping algorithm, simplified for the sake of presentation, is shown in Fig. 5(B). We use two models, described in the previous subsection, to mitigate spurious high similarity between words that are not semantically similar as caused by the text pre-processing and tokenizing errors (particularly relevant for Japanese). Another use of the two similar models is to introduce a natural stopping condition to the bootstrapping algorithm. For each property, we only consider the 10 words most similar to the current range for each model then intersect the two outputs. This overcomes the problem of setting a meaningful threshold on similarities provided by word2vec.
The algorithm iterates over the property list, P, adding new values to the range of each property, \(S_{P_i}\), until no more new values are found (newValues reaches a fixed point). For a new keyword x to be added to the property range \(S_{P_i}\) two conditions must hold: (i) all the models (two in this case) must agree that x is similar to \(S_{P_i}\) (lines 8–9); (ii) there should not be another property \(P_j\) such that x is more similar to \(S_{P_j}\) than to \(S_{P_i}\) (line 12). Observe that if x is added to \(S_{P_i}\) and also belongs to a less similar \(S_{P_j}\), then x is removed from \(S_{P_j}\) enforcing the disjointness of values in the property ranges.
Intuitively, the function most_similar finds the top n words in the vocabulary that are most similar to the words in the range of the property P. More specifically, it finds the top n vectors that maximizes the multiplicative combination of the cosine similarities (originally proposed in [15]) between the given set of vectors, \(S_{P_i}\), and the candidate vector in the model vocabulary \(V \setminus S_{P_i}\):
$$\begin{aligned} {score_{M_i}^{S_P}(candidate) =\varPi _{v\in S_P} \texttt {cos(candidate,v)} } \end{aligned}$$
Triple Generator. This modules takes all the items titles/descriptions in a given subtree \(t_1\), and the bootstrapped property-value list for \(t_1\). For every item I and every property P, it first look for the values of P in the HTML tables and semi-structured text of the description of I, if it cannot find it looks for the values in the title of I, and finally it looks for the value in the free text in the description of I. If two different values for P appear together in one of the three steps above, it ignores them and moves to the next step. Once it finds a value v, for P in I, it generates the triple (I, P, v).
Semantic Gate. The semantic gate is in charge of exposing the triples through a SPARQL end-point, and moving the new extended ontology into OWL 2 (DL). Recall that the existing taxonomy is not available in any well-known ontology language. For the SPARQL end-point we use Sesame WorkbenchFootnote 4 and OntopFootnote 5. Ontop implements an ontology-based data access (OBDA) approach. Interested readers can look at [5, 20]. We decided to use OBDA since it is a non-invasive way to introduce semantic standards (RDF/OWL/SPARQL) to the different business units in Rakuten, and it still allows the different departments to access the data through standard SQL, in which they are already proficient.
5.3 Limitations
The current implementation has two major limitations that we will work on in the future. The first one is that it only handles words as property values. Thus, alphanumeric values such as 100 ml or 2 kg cannot be handled at the moment, and therefore properties such as size are discarded. Extending our approach to handle this does not present any technical challenge but it requires time to implement it in such a way that it is not detrimental for performance. The second limitation of this implementation is that we only consider subtrees with a root with depth at most 3.