Hybrid Semantic Recommender System for Chemical Compounds
- 3.2k Downloads
Recommending Chemical Compounds of interest to a particular researcher is a poorly explored field. The few existent datasets with information about the preferences of the researchers use implicit feedback. The lack of Recommender Systems in this particular field presents a challenge for the development of new recommendations models. In this work, we propose a Hybrid recommender model for recommending Chemical Compounds. The model integrates collaborative-filtering algorithms for implicit feedback (Alternating Least Squares (ALS) and Bayesian Personalized Ranking (BPR)) and semantic similarity between the Chemical Compounds in the ChEBI ontology (ONTO). We evaluated the model in an implicit dataset of Chemical Compounds, CheRM. The Hybrid model was able to improve the results of state-of-the-art collaborative-filtering algorithms, especially for Mean Reciprocal Rank, with an increase of 6.7% when comparing the collaborative-filtering ALS and the Hybrid ALS_ONTO.
KeywordsRecommender System Implicit feedback Ontology Collaborative-Filtering Semantic similarity
The recommendation of Chemical Compounds of interest for scientific researchers has not been widely explored [9, 23]. However, Recommender Systems (RSs) may help in the discovery of compounds, for example, by suggesting items not yet studied by the researchers. One challenge in this field is the lack of available datasets with the preferences of the researchers about the Chemical Compounds for testing the RS. More recently, alternatives have emerged with the development of datasets consisting of data collected from implicit feedback. Unlike what happens with other datasets, for example, Movielens , these datasets do not contain the specific interests of the researchers. Instead, this information is extracted from the activities of the researchers, for example, through scientific literature [3, 15].
Datasets of explicit or implicit feedback require different recommender algorithms, especially because implicit feedback has some significant downgrades, such as the lack of negative feedback, and unbalanced ratio of positive vs unobserved ratings [11, 18]. When dealing with implicit feedback datasets, the solution involves applying learning to rank (LtR) approaches. LtR consists in, given a set of items, identify in which order they should be recommended .
The main approaches in RSs are Collaborative-Filtering (CF) and Content-Based (CB) . CF uses the similarity between the ratings of the users, and CB uses the similarity between the features of the items. CF approaches cannot deal with new items or new users in the system, i.e., items and users without ratings (cold start problem). CB does not need to deal with this problem for new items, and that is the main reason Hybrid RSs (CF + CB) exist. One of the tools used by CB are ontologies , which are related vocabularies of terms and definitions for a specific field of study [2, 28]. Some examples of well-known ontologies are the Chemical Entities of Biological Interest (ChEBI)1 , the Gene Ontology (GO)2 , and the Disease Ontology (DO)3 .
In this paper, we propose a Hybrid recommender model for recommending Chemical Compounds, consisting of a CF module and a CB module. In the CF module we tested two algorithms for implicit feedback datasets, Alternating Least Squares (ALS)  and Bayesian Personalized Ranking (BPR) , separately. In the CB module we explored the semantic similarity between the compounds in the ChEBI ontology (ONTO algorithm). The Hybrid model combines ALS + ONTO, and BPR + ONTO. The framework developed for this work is available at https://github.com/lasigeBioTM/ChemRecSys.
2 Related Work
There are a few studies using RS for recommending Chemical Compounds.  describes the use of CF methods for creating a Free-Wilson-like fragment recommender system.  use RS techniques for the discovery of new inorganic compounds, by applying machine-learning to find the similarity between the proposed and the existent compounds.
Next, we describe studies using ontologies for improving the performance of CF algorithms.  created a RS for recommending English collections of books in a library. The authors developed PORE, a personal ontology Recommender System, which consists of a personal ontology for each user and then the application of a CF method. They used a standard normalized cosine similarity for finding the similarity between the users.  also used an ontology for creating users’ profiles for the domain of books. They calculated the similarity, not between the ratings of the users, but based on the interest scores derived from the ontology. The CF method used was the k-nearest neighbours.  developed a Trust–Semantic Fusion approach, tested on movies and Yahoo! datasets. Their approach incorporates semantic knowledge to the items primary information, using knowledge from the ontologies. They used the user-based Constrained Pearson Correlation and the user-based Jaccard similarity.
 presented a solution for the top@k recommendations specifically for implicit feedback data. The authors developed the Spank - semantic path-based ranking. They extracted path-based features of the items from DBpedia and used LtR algorithms to get the rank of the most relevant items. They tested the method on music and movies domains.  developed a new semantic similarity measure, the Inferential Ontology-based Semantic Similarity. The new measure improved the results of a user-based CF approach, using Pearson Correlation for calculating the similarity between the users. The authors tested the approach on the tourism domain. Most recently,  developed a Hybrid RS tested on the movies domain. The method used Single Value Decomposition for dimensionality reduction for the item and user-based CF, and ontologies for item-based semantic similarity, improving the CF results. They do not deal with implicit data.
To the best of our knowledge, our study is the first to use semantic similarity for recommending Chemical Compounds, dealing with implicit data by using state-of-the-art methods (ALS and BPR) and improving the results for the top@k in several evaluation metrics.
3 The Proposed Model
For the CF module, we selected state-of-the-art CF recommender algorithms for implicit data4, ALS  and BPR . ALS is a latent factor algorithm that addresses the confidence of a user-item pair rating. BPR is also a latent factor algorithm, but it is more appropriate for ranking a list of items. BPR does not just consider the unobserved user-item pairs as zeros, but instead, it takes into consideration the preference of a user between an observed and an unobserved rating.
Whereas the CF module uses all the ratings from the train set to train the model, CB module only takes into account the ratings of each user. Using DiShin, we calculate the value of the similarity between each item in the train test and the items in test set.
4 Experiments and Results
Experiments. The data used in this work is a subset of a dataset of Chemical Compounds, CheRM, with the format of <user,item,rating> . The users are authors from research articles, the items are Chemical Compounds present in ChEBI, and the ratings (implicit) are the number of articles the author wrote about the item6. The subset has 102 Chemical Compounds, 1184 authors, 5401 ratings, and a sparsity level of 95.5%. We used a subset of CheRM because it has more than 22,000 items and there is a bottleneck in the calculation of the similarity between all the items in real time.
Results. We present the results of this study in Fig. 3, for all the algorithms and all the metrics described previously. Analysing Fig. 3, the ONTO algorithm alone has the lowest results in all metrics. Nevertheless, in metrics such as Precision, Recall and F-measure, it follows the trend of the other algorithms, and when measuring these metric for the top@20, the results are similar. ONTO has the advantage of being a CB algorithm, therefore it does not have the problem of cold start for new items. ALS and BPR cannot be used if the item in the test set is not in the train set at least once (at least one author in the train set wrote about this Chemical Compound).
Between ALS and BPR, ALS achieved the best results. Since BPR is an algorithm for ranking, it was expected to obtain better results. We believe this is due the fact that the dataset has a large number of ratings equal to one, and many items have the same relevance (difficult to rank).
The approach with the best results in most of the metrics is the Hybrid ALS_ONTO. The use of ALS and ONTO algorithms together has a particularly positive effect on the metrics measuring the ranking accuracy (MRR, nDCG and AUC), especially for MRR, with an increase of 6.7% when comparing the ALS algorithm and the Hybrid ALS_ONTO. This means that ONTO reorder ALS scores in a way that the first results in the top@k are more relevant.
These are preliminary results. The study needs to be replicated with the full CheRM dataset, and we need to perform more studies to see the real impact for the cold start problem. Nevertheless, the results seem promising, in the one hand for improving the relevant recommendations provided (CAMet), and on the other hand in enhancing the position of the most relevant items in a ranked list (RAMet). Our Hybrid algorithm may be applied to other areas, for example, for genes, phenotypes, and diseases, provided that exists an ontology for these items.
In this work, we presented a Hybrid recommendation model for recommending Chemical Compounds, based on CF algorithms for implicit data and a CB algorithm based on semantic similarity of the Chemical Compounds using the ChEBI ontology. The obtained results support our hypothesis that by using the semantic similarity between the Chemical Compounds, the results of state-of-the-art CF algorithms can be improved. For future work we intend to increase the length of the dataset, to test other similarity metrics, and to test other alternatives to calculate the final score of the Hybrid algorithm.
- 4.Consortium, G.O.: The gene ontology resource: 20 years and still going strong. Nucleic Acids Res. 47(D1), D330–D338 (2018)Google Scholar
- 5.Couto, F., Lamurias, A.: Semantic similarity definition. In: Encyclopedia of Bioinformatics and Computational Biology, vol. 1 (2019)Google Scholar
- 6.Harper, F.M., Konstan, J.A.: The MovieLens datasets: history and context. ACM Trans. Interact. Intell. Syst. (TIIS) 5(4), 1–19 (2015)Google Scholar
- 8.Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272. IEEE (2008)Google Scholar
- 10.Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008 (1997)Google Scholar
- 13.Lin, D., et al.: An information-theoretic definition of similarity. In: ICML, vol. 98, pp. 296–304. Citeseer (1998)Google Scholar
- 16.Ostuni, V.C., Di Noia, T., Di Sciascio, E., Mirizzi, R.: Top-N recommendations from implicit feedback leveraging linked open data. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 85–92. ACM (2013)Google Scholar
- 17.Rendle, S., Balby Marinho, L., Nanopoulos, A., Schmidt-Thieme, L.: Learning optimal ranking with tensor factorization for tag recommendation. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 727–736. ACM (2009)Google Scholar
- 18.Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian personalized ranking from implicit feedback. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 452–461. AUAI Press (2009)Google Scholar
- 19.Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007 (1995)Google Scholar
- 22.Schröder, G., Thiele, M., Lehner, W.: Setting goals and choosing metrics for recommender system evaluations. In: UCERSTI2 Workshop at the 5th ACM Conference on Recommender Systems, Chicago, USA, vol. 23, p. 53 (2011)Google Scholar
- 26.Sieg, A., Mobasher, B., Burke, R.: Improving the effectiveness of collaborative recommendation with ontology-based user profiles. In: Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems, pp. 39–46. ACM (2010)Google Scholar