Abstract
Machine learning has recently enabled large advances in artificial intelligence, but these results can be highly centralized. The large datasets required are generally proprietary; predictions are often sold on a per-query basis; and published models can quickly become out of date without effort to acquire more data and maintain them. Published proposals to provide models and data for free for certain tasks include Microsoft Research’s Decentralized and Collaborative AI on Blockchain. The framework allows participants to collaboratively build a dataset and use smart contracts to share a continuously updated model on a public blockchain. The initial proposal gave an overview of the framework omitting many details of the models used and the incentive mechanisms in real world scenarios. For example, the Self-Assessment incentive mechanism proposed in their work could have problems such as participants losing deposits and the model becoming inaccurate over time if the proper parameters are not set when the framework is configured. In this work, we evaluate the use of several models and configurations in order to propose best practices when using the Self-Assessment incentive mechanism so that models can remain accurate and well-intended participants that submit correct data have the chance to profit. We have analyzed simulations for each of three models: Perceptron, Naïve Bayes, and a Nearest Centroid Classifier, with three different datasets: predicting a sport with user activity from Endomondo, sentiment analysis on movie reviews from IMDB, and determining if a news article is fake. We compare several factors for each dataset when models are hosted in smart contracts on a public blockchain: their accuracy over time, balances of a good and bad user, and transaction costs (or gas) for deploying, updating, collecting refunds, and collecting rewards. A free and open source implementation for the Ethereum blockchain of these models is provided at https://github.com/microsoft/0xDeCA10B.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The advancement of popular blockchain based cryptocurrencies such as Bitcoin [1] and Ethereum [2] have inspired research in decentralized applications that leverage these publicly available resources. One application that can greatly benefit from decentralized public blockchains is the collaborative training of machine learning models to allow users to improve a model in novel ways [3]. There exists several proposals to use blockchain frameworks to enable the sharing of machine learning models. In DInEMMo, access to trained models is brokered through a marketplace allowing contributors to profit based on a model’s usage, but it limits access to just those who can afford the price [4]. DanKu proposes a framework for competitions by storing already trained models in smart contracts, which do not allow for continual updating [5]. Proposals to change Proof-of-Work (PoW) to be more utilitarian by training machine learning models have also gained in popularity such as: A Proof of Useful Work for Artificial Intelligence on the Blockchain [6]. These approaches can incite more technical centralization such as harboring machine learning expertise, siloing proprietary data, and access to machine learning model predictions (e.g. charged on a per-query basis). In the crowdsourcing space, a decentralized approach called CrowdBC has been proposed to use a blockchain to facilitate crowdsourcing [7].
To address centralization in machine learning, frameworks to share machine learning models on a public blockchain while keeping the models free to use for inference have been proposed. One example is Decentralized and Collaborative AI on Blockchain from Microsoft Research [3]. That work focuses on the description of several possible incentive mechanisms to encourage participants to add data to train a model. This is the continuation of previous work in [3] by the author.
The system proposed in [3] is modular: different models or incentive mechanisms (IMs) can be used an seamlessly swapped: however, some IMs might work better for different models and vice-versa. These models can be efficiently updated with one sample at time making them useful for deployment on Proof-of-Work (PoW) blockchains [3] such as the current public Ethereum [2] blockchain. The first is a Naive Bayes classifier for its applicability to many types of problems [8]. Then, a Nearest Centroid Classifier [9]. Finally, a single layer Perceptron model [10].
We evaluated the models on three datasets that were chosen as examples of problems that would benefit from collaborative scenarios where many contributors can improve a model in order to create a shared public resource. The scenarios were: predicting a sport with user activity from Endomondo [11], sentiment analysis on movie reviews from IMDB [12], and determining if a news article is fake [13]. In all of these scenarios users benefit from having a direct impact on improving a model they frequently use and not relying on a centralized authority to host and control the model. Transaction costs (or gas) for each operation were also compared since these costs can be significant for the public Ethereum blockchain.
The Self-Assessment IM allows ongoing verification of data contributions without the need for a centralized party to evaluate data contributions. Here are the highlights of the IM as explained in [3]:
-
Deploy: One model, h, already trained with some data is deployed.
-
Deposit: Each data contribution with data x and label y also requires a deposit, d. Data and meta-data for each contribution is stored in a smart contract.
-
Refund: To claim a refund on their deposit, after a time t has passed and if the current model, h, still agrees with the originally submitted classification, i.e. if \(h(x) == y\), then the contributor can have their entire deposit d returned.
-
We now assume that (x, y) is “verified” data.
-
The successful return of the deposit should be recorded in a tally of points for the wallet address.
-
-
Take: A contributor that has already had data verified in the Refund stage can locate a data point (x, y) for which \(h(x) \ne y\) and request to take a portion of the deposit, d, originally given when (x, y) was submitted.
If the sample submitted, (x, y) is incorrect, then within time t, other contributors should submit \((x, y')\) where \(y'\) is the correct or at least generally preferred label for x and \(y' \ne y\). This is similar to how one generally expects bad edits to popular Wikipedia [14] articles to be corrected in a timely manner.
As proposed, the Self-Assessment IM could result in problems such as participants losing deposits and the model becoming inaccurate if the proper parameters are not set when the framework is initially deployed. In this work, we analyze the choice of several possible supervised models and configurations with the Self-Assessment IM in order to find best practices.
2 Machine Learning Models
In this section, we outline several models choices of machine learning model for use with Decentralized and Collaborative AI on Blockchain as proposed in [3]. The model architecture chosen relates closely to the incentive mechanism chosen. In this work, we will analyze models for the Self-Assessment incentive mechanism as it appeals to the decentralized nature of public blockchains in that a centralized organization should not need to maintain the IM, for example, by funding it [3].
For our experiments, we mainly consider supervised classifiers because they can be used for many applications and can be easily evaluated using test sets. In order to keep transaction costs low, we first propose to leverage the work in the Incremental Learning space [15] by using models capable of efficiently updating with one sample. Transaction costs, or “gas” as it is called in Ethereum [2], are important for most public blockchains as a way to pay for the computation cost for executing a smart contract.
2.1 Naive Bayes
The model first is a Naive Bayes classifier for its applicability to many types of problems [8]. The Naive Bayes classifier assumes each feature in the model is independent, this is what helps makes computation fast when updating and predicting. To update the model, we just need to update several counts such as the number of data points seen, the number of times each feature was seen, the number of times each feature was seen for each class, etc. When predicting, all of these counts are used for the features presented in the sparse sample to compute the most likely class for the sample using Bayes’ Rule [8].
2.2 Nearest Centroid
A Nearest Centroid Classifier computes the average point (or centroid) of all points in a class and classifies new points by the label of the centroid that they are closest to [9]. They can also be easily adapted to support multiple classifications (which we do not do for this work). For this model, we keep track of the centroid for each class and update it using the cumulative moving average method [16]. Therefore we also need to record the number of samples that have been given for each class. Updating the model with one sample needs to update the centroid for the given class but not for the other classes. This model can be used with dense data representations.
2.3 Perceptron
A single layer perceptron model is useful linear model for binary classification [10]. We evaluate this model because it can be used for sparse data like text as well as dense data. The Perceptron’s update algorithm only updates the weights if the model currently classifies the sample as incorrect. This is good for our system since it should help avoid overfitting. The model can be efficiently updated by just adding or subtracting, depending on the sample’s label, the values for the features of the sample with the model’s weights.
3 Datasets
We used three datasets were chosen as examples of problems that would benefit from collaborative scenarios where many contributors can improve a model in order to create a shared public resource. In each scenario, the users of an application that would use such a model benefit by having a direct impact on improving the model they frequently use and not relying on a centralized authority to host and control the model.
3.1 Fake News Detection
Given the text for a news article, the task is the determine if the story is reliable or not [13]. We convert each text to a sparse representation using the term-frequency of the bigrams with only the top 1000 bigrams by frequency count in the training set considered. While solving fake news detection is likely too difficult for simple models, a detector would greatly benefit from decentralization: freedom from being biased by a centralized authority.
3.2 Activity Prediction
The FitRec datasets contain information recorded from the use of participants’ fitness trackers during certain activities [11]. In order to predict if someone was biking or running, we used the following features: heart rate, maximum speed, minimum speed, average speed, median speed, and gender. We did some simple feature engineering with those features such as using average heart rate divided by minimum heart rate. As usual, all of our code is public.
Fitness trackers and start-ups developing them have gained in popularity in recent years. A user considering purchasing a new tracker might not trust that the manufacturer developing it will still be able to host a centralized model in few years. The company could go bankrupt or just discontinue the service. Using a decentralized model gives users confidence that the model they need will be available for a long time, even if the company is not. This should even give them the assurance to buy the first version of a product and knowing that it should improve without them getting forced into buying a later version of the product. Even if the model does get corrupted, applications can easily revert to an earlier version on the blockchain, still giving users the service they need [3].
3.3 IMDB Movie Review Sentiment Analysis
The dataset of 25,000 IMDB movie reviews from is a dataset for sentiment analysis where the task is to predict if the English text for a movie review is positive or negative [12]. We used word-based features limited to only the most 1000 common words in the dataset. This particular sentiment analysis dataset was chosen for this work because of it’s size and popularity. Even though this dataset focuses on movie reviews, in general, a collaboratively built model for sentiment analysis can be applicable in many scenarios such as a system to monitor posts on social media. Users could train a shared model when they flag posts or messages as abusive and this model can be used by several social media services to provide a more pleasant experience to their users.
4 Experiments
We conducted experiments for the three datasets with each of the three models. Experiments ran in simulations to quickly determine the outcome of different configurations. The code for our simulations is all public. The simulation iterates over the samples in the dataset submitting each sample once. For simplicity, we assumed that each scenario just has two agents representing the main two types of user groups: “good” and “bad”. We refer to these as agents since they may not be real users but could be programs possibly even generating data to submit. The “good” agent almost always submits correct data with the label as provided in the dataset, as a user would normally submit correct data in a real-world use case. The “bad” agent represents those that wish to decrease the model’s performance, so the “bad” agent always submits data with the opposite label that was provided in the dataset. Since the “bad” agent is trying to corrupt the model, they are willing deposit more (when required) to update the model. This allows them to update the model more quickly after the model has already been updated. The “good” agent only updates the model if the deposit required to do so is low, otherwise they will wait until later. They also check the model’s recent accuracy on the test set before submitting data. In the real world, it is important for people to monitor if the model’s performance and determine if it is worth trying to improve it or if it is totally corrupt. If the model’s accuracy is around 85% then it can be assumed to be okay and not overfitting so ideally, it should be safe to submit new data. If incorrect data was always submitted, or submitted too often by “bad” agents, then of the model’s accuracy should decrease and honest users would most likely lose their deposits because their data would not satisfy the refund criteria of the IM. We use loose terms here like “should” and “likely” because it is difficult to be general in terms of all types of models. For example, certainly a rule-based model could be used that memorizes training data. As long as no duplicate data is submitted with different labels, a rule-based model would allow each participant to get their deposits back and the analysis would be trivial. The characteristics of the agents are compared in Table 1.
Each agent must wait 1 day before being claiming a refund for “good” data or reporting the data as “bad”. This was referred to as t in our original paper. When reporting data as “bad”, an agent can an amount from the initial deposit portional to the percent of “verified” contributions they have. This can be written as \(r(c_r, d) = d \times \frac{n(c_r)}{\sum _{\text {all } c} n(c)}\) using the notation in our initial paper. After 9 days, either agent can claim the entire remaining deposit for a specific data sample. This was \(t_a\) in our original paper.
For each dataset, we compared:
-
The change of each agent’s balance over time. While using the IM, an agent may lose deposits to the other agent, reclaim their deposit, or profit by taking deposits that were from the other agent. We monitor balances in order to determine if it can be beneficial for an agent to participate by submitting data, whether it be correct or incorrect.
-
The change of the model’s accuracy with respect to a fixed test set over time. In a real-world scenario, it would be important for user’s to monitor the accuracy as a proxy to measure if they should continue to submit data to the model. If the accuracy declines, then it could mean that “bad” agents have corrupted the model.
-
The “ideal” baseline of the model’s accuracy on the test set if the model were to be trained all of the simulation data. In the real-world, this would of course not be available because the data would not be known yet.
We also compared Ethereum gas costs (i.e. transaction costs) for the common actions that are done in the framework. The Update gas cost shown for each model was when the model did not agree with the provided label classification and so needed its weights to be updated. Otherwise, the Perceptron Update method would be only slightly more than prediction because a Perceptron model does not get updated if it currently predicts the same classification as the label it is given for a data sample. The gas cost of predicting is not shown because it can be done “off-chain” (without creating a transaction) which incurs no gas cost since it does not involve writing data to the blockchain. However, predicting is the most expensive operation inside of Refund and Report so the cost of doing prediction “on-chain” can be estimated using those operations. Contracts were compiled with the “solc-js” compiler using Solidity 0.6.2.
4.1 Fake News Detection
With each model, the “good” agent was able to profit and the “bad” agent lost funds. As can be seen in Fig. 1, the difference in balances was most significant with the Perceptron model. The Perceptron model has the highest accuracy yet the Naive Bayes was able to surpass its baseline accuracy.
The Perceptron model has the lowest gas cost as shown in Table 2. The deployment cost for the Naive Bayes model was much higher because each of the 1000 features effectively needs to be set twice (once for each class). The update method for the Nearest Centroid Classifier is expensive because it needs to go through most dimensions of the 1000 dimensional centroids. Prediction (which happens in Refund and Report) did not need to go through each dimension because the distance to each centroid can be calculated by storing the magnitude of each centroid and then using the sparse input data to find the difference from the magnitude just for the few features in the sparse input.
4.2 Activity Prediction
As seen in Fig. 2, with each model, all “good” agents can profit while the “bad” agent wastes lots of funds. The Naive Bayes (NB) and Nearest Centroid Classifier (NCC) models performed very well on this type of data, hardly straying from the ideal baseline. The linear Perceptron on the other hand was much more sensitive to data from the “bad” agent and it’s accuracy dropped significantly several times but finally recovering.
The Perceptron model has the lowest gas cost as shown in Table 3. The gas costs were fairly close for each action amongst the models, especially compared to the other datasets. This is mostly because there are very few features (just 9) for this dataset.
4.3 IMDB Movie Review Sentiment Analysis
Figure 3 shows all “good” agents can profit while the “bad” agent loses most or all of the initial balance. All models maintained their accuracy with this type of data with the Naive Bayes model performing the best.
By only a small amount, the Naive Bayes model beats the Perceptron model for the Update method with the lowest gas cost. The gas costs for all actions are shown in Table 4. As with the Fake News dataset, the Update cost for the Nearest Centroid Classifier was high because most dimensions needs to be visited. The Naive Bayes model had a much higher deployment cost since the amount of data was effectively double since each feature needs to be set for each of the two classes.
5 Conclusion
With all experiments, the Perceptron model was consistently the cheapest to use. This was mostly because the size of the model was much less than the other two models which need to store information for each class, effectively twice the amount of information that the Perceptron needs to store. While each model was expensive to deploy, this is a one time cost to incur. This cost is far less than the comparable cost to host a web service with the model for several months.
Most models were able to maintain their accuracy except for the volatile Perceptron for the Activity Prediction dataset. Even if the model gets corrupted with incorrect data, it can be forked from an earlier time when its accuracy on a hidden test set was higher. It can also be retrained with data identified as “good” while it was deployed. It is important for users to be aware of the accuracy on the model on some hidden test set. Users can maintain their own hidden test sets or possibly use a service supplied by an organization which would publish a rating on a model based on the test sets they have.
The balance plots looked mostly similar across the experiments because the “good” agent was already careful and how we set a constant wait time of 9 days for either agent to claim the remaining deposit for a data contribution. The “good” agent honestly submitted correct data and only did so when they thought the model was reliable, this helped ensure that they can recover their deposits and earn for reporting many contributions from the “bad agent”. When the “bad” agent is able to corrupt, it can successfully report a portion of the contributions from the “good” agent as bad because the model would not agree with those contributions. The “bad” agent cannot claim a majority of these deposits when reporting the contribution since they do not have as many “verified” contributions as the “good” agent. This leaves a left over amount for which either agent must wait for 9 days before taking the entire remaining deposit, hence the periodic looking patterns in the balance plots every 9 days. The pattern continues throughout the simulation because there is always data for which the deposit that cannot be claimed by either agent after just the initial refund wait time of 1 day.
Future work in analyzing more scenarios is encouraged and easy to implement with our open source tools at https://github.com/microsoft/0xDeCA10B/tree/master/simulation. For example, changing the initial balances of each agent to determine how much a “good” agent need to spend to stop a much more resourceful “bad” agent willing to corrupt a model.
References
Nakamoto, S., et al.: Bitcoin: a peer-to-peer electronic cash system (2008)
Buterin, V.: A next generation smart contract & decentralized application platform (2015)
Harris, J.D., Waggoner, B.: Decentralized and collaborative AI on blockchain. In: 2019 IEEE International Conference on Blockchain (Blockchain), July 2019
Marathe, A., Narayanan, K., Gupta, A., Pr, M.: DInEMMo: decentralized incentivization for enterprise marketplace models. In: 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW), pp. 95–100 (2018)
Kurtulmus, A.B., Daniel, K.: Trustless machine learning contracts; evaluating and exchanging machine learning models on the ethereum blockchain (2018)
Lihu, A., Du, J., Barjaktarevic, I., Gerzanics, P., Harvilla, M.: A proof of useful work for artificial intelligence on the blockchain (2020)
Li, M., et al.: CrowdBC: a blockchain-based decentralized framework for crowdsourcing. IEEE Trans. Parallel Distrib. Syst. 30(6), 1251–1266 (2019)
Webb, G.I.: Naïve bayes. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 713–714. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-30164-8_576
Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Nat. Acad. Sci. 99(10), 6567–6572 (2002)
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)
Ni, J., Muhlstein, L., McAuley, J.: Modeling heart rate and activity data for personalized fitness recommendation. In: The World Wide Web Conference (WWW 2019), New York, NY, USA, pp. 1343–1353. Association for Computing Machinery (2019)
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. Association for Computational Linguistics, June 2011
Kaggle: UTK Machine Learning Club: Fake News (2020). https://www.kaggle.com/c/fake-news/overview. Accessed 07 Jan 2020
Wikipedia contributors: Wikipedia – Wikipedia, the free encyclopedia (2020). https://en.wikipedia.org/w/index.php?title=Wikipedia. Accessed 08 Jan 2020
Schlimmer, J.C., Fisher, D.: A case study of incremental concept induction. In: Proceedings of the Fifth AAAI National Conference on Artificial Intelligence (AAAI 1986), pp. 496–501. AAAI Press (1986)
Wikipedia contributors: Moving average – Wikipedia, the free encyclopedia (2020). https://en.wikipedia.org/w/index.php?title=Moving_average. Accessed 3 Feb 2020
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Harris, J.D. (2020). Analysis of Models for Decentralized and Collaborative AI on Blockchain. In: Chen, Z., Cui, L., Palanisamy, B., Zhang, LJ. (eds) Blockchain – ICBC 2020. ICBC 2020. Lecture Notes in Computer Science(), vol 12404. Springer, Cham. https://doi.org/10.1007/978-3-030-59638-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-59638-5_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59637-8
Online ISBN: 978-3-030-59638-5
eBook Packages: Computer ScienceComputer Science (R0)


