Keywords

1 Introduction

The goal of this work is to create a 5-starFootnote 1 open data dataset about Russian food products and their ingredients. Such work involves (a) food ontology development, (b) crawling of the existing sources, (c) publishing of the information as Linked Data and (d) linking to existing LOD datasets, such as AGROVOC [1] and DBpedia [2].

Based on the dataset that is created using Semantic Web technologies, new applications and services can be built, e.g. manufacturers can uses it to standardise the names for the ingredients, retailers can reuse the information on their e-shops, developers can built applications for customers that help them decide which product to buy based on their health conditions or personal preferences.

2 Dataset Creation

The source of the information for FOODpedia is web site called GoodsMatrixFootnote 2 which is manually curated catalogue where information comes mainly from manufacturers.

Extraction of food product information from GoodsMatrix goes through a pipeline that includes (a) crawling the web site using ScrapyFootnote 3 framework and set of XPath expressions, (b) parsing the resulting data to extract information about energy values, ingredients and E-additives, (c) translation of the name and description to English and (d) linking ingredients information to resource in AGROVOC and DBpedia datasets.

The source code of the crawler and other artifacts are available in Github repositoryFootnote 4.

Extraction of Ingredients. Ingredients are crawled as a list of ingredients separated by some character such as comma or semicolon. But there is an unsolved issue, it’s rare when different manufacturers use the same names for the same ingredients, some ingredients can have more than dozen alternative names. Usually such names are different only because of word order, missing or extra words, therefore we apply the Ratcli-Obershelp algorithm [3] to measure string similarity and create single resource for similar names.

Extraction of E-additives. E-additives are food additives which have special identifiers called E numbers such as E-100, E-201, etc. and are used in Europe, Russia and other countries. Since the identifiers have well-defined structure, it’s quite easy to find them in the ingredient list using regular expressions. The only issue is additives which have E-number, but written on the package without its number, e.g. CurcuminFootnote 5.

Multilingual Support. The name and description of food product crawled earlier are translated to English with help of Yandex.Translate APIFootnote 6.

Linking. Extracted E-additives and ingredients are linked to similar resource in AGROVOC and DBpedia datasets.

AGROVOC is a multilingual agricultural thesaurus consisting of over 32 000 concepts available in 21 languages including Russian, therefore it’s a good candidate for linking. Ingredients are mapped to AGROVOC concepts automatically, but it doesn’t support E numbers because of that they are mapped manually.

DBpedia is a good source of human readable descriptions of concepts, therefore it’s interesting to link E-additives and ingredients to its resources, but it’s not so easy, because the ontology is generated semi-automatically. Therefore the mapping is performed manually.

3 Ontologies

To represent food products and their ingredients, Food Product OntologyFootnote 7 were developed which extends GoodRelationsFootnote 8 and Food OntologyFootnote 9. Below you find an example of food product in Turtle:

figure a

Also an example of ingredient with links to similar resource in AGROVOC and DBpedia datasets:

figure b

4 Publishing

The dataset is published using PubbyFootnote 10. The interface for human and machine consumption is available at http://foodpedia.tk. Using the SPARQL endpointFootnote 11 provided by the underlying Virtuoso Triple StoreFootnote 12, actors are able to satisfy complex information needs. In addition, actors are able to use another query interface through Linked Data Fragments [4] serverFootnote 13 for high-availability querying. And last, human can use a simple search interface (see Fig. 1) to find food products by its barcode or name.

Fig. 1.
figure 1

FOODpedia search interface

Licensing. All published data is openly licensed under Creative Commons Attribution License in accordance with the open definitionFootnote 14.