1 Introduction

Live e-commerce, particularly rural live e-commerce, which can help all types of agricultural products sell farther and better and drive farmers’ income to improve, will play a greater role with the complete development of China’s rural rejuvenation. Live webcasting platforms are expanding quickly in the age of new media, and live streaming of goods has an exceptionally explosive nature [1, 2]. In order to improve students’ entrepreneurial skills in rural real e-commerce practice, it should be driven by the demands of businesses and innovative instructional methods [3]. Enterprise instructors help students carry out rural live e-commerce project practice, strengthen students’ training in innovation and entrepreneurship, and improve new farmers’ live e-commerce entrepreneurial skills [4, 5] using project-based practical teaching to replace the conventional fill-in-the-duck teaching method.

Rural live e-commerce has a catalytic influence on the transformation of regional rural industries, and young farmers’ development of their entrepreneurial e-commerce abilities can aid in this process [6]. The development of new farmers’ skills goes beyond individual talent development or the modernization of conventional ideas to emphasize the value of teamwork [7]. Along with anchoring, customer service, operations, logistics, and other businesses are all crucial to the growth of rural live e-commerce [8]. In order to modernize and upgrade Jinhua’s rural industries, local rural enterprises that use live e-commerce must create a comprehensive ecological chain from people to goods to industry, while also growing the local economy [9,10,11].

In essence, e-commerce is a modern and new form of business management, and its development can bring new momentum to promote the innovation and reform of the production mode and development structure of traditional industries, and help to enhance the level of social and economic development [12]. While the rural revitalization strategy is to change the original production and life style of the countryside with the policy support of the national government, to drive the people out of poverty and promote local social and economic development, which has many similarities with the development goal of e-commerce. Therefore, the practice and development of rural e-commerce have great practical significance in the context of rural development [13]. First, the supply-side structural reform of rural industries should be vigorously promoted to speed up the process of industrialized agriculture development. The traditional agricultural supply object is mainly oriented to rural areas, while the demand group of its products is mainly in the surrounding towns [14, 15]. This phenomenon that the production side and the sales place are not in the same area profoundly affects the effective transmission of information on supply and demand of agricultural products; in addition, the scale of rural agricultural development is generally small, and the farmers have low enthusiasm to obtain the latest market information on supply and demand of products, so the contradiction between supply and demand often occurs in the rural agricultural market [15,16,17,18]. With the development of e-commerce in rural areas, the data provided by various e-commerce platforms can be used to accurately analyze the market situation of agricultural products, so as to fully understand the supply and demand of agricultural products, thus laying the foundation for the timely adjustment of product production and development structure in rural areas and promoting the supply-side structural reform of rural industries, so as to guarantee the development of local agricultural industrialization [19].

Collaboration-based recommendation refers to the recommendation system based on the correlation between the target customer and other customers [20,21,22,23,24]. When the system finds that a customer or a group of customers has similar consumption preferences with the target customer, the system can predict the consumption behavior of the target user based on the consumption behavior of these users [25]. However, in an e-commerce website, the number of customers and the number of products are very huge, and in this case, it is a very difficult problem to find a group of customers with similar consumption preferences to the target customers accurately [26,27,28].

The contributions of this paper are as follows:

This paper analyzes and classifies the preferences of product buyers through collaborative filtering methods and proposes a ROCK-based clustering algorithm to successfully solve the problem of data scarcity based on collaborative recommendations.

The designed recommendation system can individually predict the target customer’s evaluation of each product in the candidate product set and take the top N products with the highest evaluation value and recommend them to the target customer.

The case study shows that the solution in this paper can correctly predict the probability of rural economic growth by 5%.

2 Data Scarcity

Predicting how much a person will enjoy something he has not examined is the goal of collaborative filtering. The projection is based on historical data from assessments of a number of objects by previous user groups. Both explicit and implicit user evaluations of items are possible. A high number implies that the user enjoys the product very much, and vice versa. Explicit ratings are often consumer ratings of a product in the form of a numerical value.

The values in Table 1 represent the value of the customer’s rating of the item, and the higher the value, the more the customer likes the item. From Table 1, it will be found that the consumption preferences of customer _ E and customer _ A are the same, therefore, we can determine that customer _ E will also like the commodity _ 5.

Table 1 customer evaluation information on goods

Generally, two customers with identical consumption preferences rarely exist, as shown in Eq. (1). In the formula, \({\mathbf{R}}_{x,d}\) denotes customer x’s evaluation of product d, and \(\overline{R}_{x}\) is the average of customer x’s evaluation of the product:

$$r(x,y) = \frac{{\left. {\left. {\sum\limits_{{d \in {\text{Total products}}}} {{\mathbf{R}}_{x,d} } - \overline{R}_{x} } \right){\mathbf{R}}_{y,d} - \overline{R}_{y} } \right)}}{{\sqrt {\left. {\left. {\sum\limits_{{d \in {\text{ Total products}}}} {R_{x,d} } - \overline{R}_{x} } \right)^{2} \sum\limits_{d \in } {\sum\limits_{{\text{Total products}}} {R_{y,d} } } - \overline{R}_{y} } \right)^{2} } }}$$
(1)

Assuming that the target customer y has m neighbor customers, using the preference similarity degree r, we can predict the rating of product i by the target customer y. The prediction can be calculated according to Eq. (2):

$$R_{y,i} = \frac{{\sum\limits_{{x \in {\text{ Total products}}}} \gamma (x,y) \times R_{x,i} }}{m}$$
(2)

3 Clustering Algorithm Based on Category Attributes

The customers in the same cluster have the same purchase pattern. For example, a customer in one cluster is mainly a married woman with children, and her purchases are mainly toys, children’s food, etc., while a customer in another cluster is mainly a high-income person, and his purchases are mainly expensive imported goods [29].

The intersection of customers belonging to the same cluster and the goods they are interested in is often large, so it will be highly efficient to find evaluation information tables that meet the requirements in the same cluster; in addition, since the customer groups in the same cluster have the same consumption behavior pattern, it will be more accurate to predict the consumption behavior of the target customers with the information of the customers in the cluster where the target customers are located; in addition, since the clustering process can be performed offline, the clustering process does not affect the response speed of the recommendation system.

3.1 Defects of Traditional Clustering Algorithms

Traditional clustering methods require the construction of a k-division of the data that makes the optimization function close to optimal. The usual optimization criterion is such that the sum of the Euclidean distances from each point to the cluster centroids is minimized. That is,

$$\min E = \sum\limits_{i = 1}^{k} {\sum\limits_{{x \in C_{i} }} d } \left( {x,m_{i} } \right)$$
(3)

In Eq. (3), \(m_{i}\) is the centroid of cluster \(c_{i}\) and \(d\left( {x,m_{i} } \right)\) is the Euclidean distance from tuple x to \(m_{i}\). This processing logic works well for data of numeric type, but does not apply to data of category attributes.

Consider a shopping record database of a shopping mall, where a transaction record \(T_{i}\), records the items included in the transaction, and for convenience, \(T_{i}\) is denoted as the set {A, B,…}, and the items A, B,…in the set denote the items included in the transaction. Usually, we do not consider the transaction volume of an item, but only qualitatively indicate whether an item is included in a transaction, so the transaction record can also be transformed into the form of Table 2. In Table 2, TID denotes the transaction code and UID denotes the customer code. An item takes a value of 0 to indicate that the transaction does not contain that item.

Table 2 Transaction record table

The traditional clustering method is not applicable to category attribute data. For example, there are four transactions: \(T_{1} \{ 1,2,3,5\} {, }T_{2} \{ 2,3,4,5\} , \, T_{3} \{ 1,4\} {, }\,and\, T_{4} \{ 6\}\), we use (0\1) to represent the purchase of these items, and these four transactions can be represented as \(T_{1} \{ 1,1,1,0,1,0\} , \, T_{2} \{ 0,1,1,1,1,0\} , \, T_{3} \{ 1,0,0,1,0,0\} , \, and \,T_{4} \{ 0,0,0,0,0,1\}\). Using Euclidean distance to represent the proximity between points, the distance between \(T_{1} ,T_{2}\) is the shortest \(\sqrt 2\); so according to the traditional algorithm, they should be merged, and after merging, the emergence of a new cluster (consisting of \(T_{1} {,}T_{2}\), we may call it the center of the new cluster, which we may call \(T_{12}\)) can be denoted as \(T_{12} \{ 0.5,1,1,0.5,1,0\}\).

3.2 ROCK Algorithm

ROCK (A Robust Clustering Algorithm for Categorical Attributes) was proposed [30]. The ROCK algorithm uses link as the criterion for cluster partitioning rather than the adopted distance. The logic of the algorithm is roughly described as follows.

Given a threshold 0 < θ ≤ 1, if two transactions \(T_{i} ,T_{j}\) are neighbors of each other, the following equation is required, i.e.,

$${\text{sim}}(T_{i} ,T_{j} ) > \theta$$
(4)

The formula indicates the similarity between \(T_{i} ,T_{j}\). It can be calculated according to the following formula:

$${\text{sim}}(T_{i} ,\,T_{j} ) = \frac{{T_{i} \cap T_{i} }}{{T_{i} \cup T_{j} }}$$
(5)

A link is the number of common neighbors between two points. If the \({\text{link}}(T_{i} ,T_{j} )\) is larger, then the probability of \(T_{i} ,T_{j}\) being in the same cluster is higher.

Since the optimization criterion of the ROCK algorithm is to make the link between transactions in the same cluster as large as possible, any two transactions in the same cluster will have more common items, i.e., the transaction patterns are similar. Thus, it can be seen that customers with similar consumption behavior patterns can be clustered into one group using the ROCK.

4 Recommendation System

Based on the above principles, we designed an e-commerce recommendation system based on cluster analysis, and the framework model of the system is shown in Fig. 1. The working principle of the model is as follows.

Fig. 1
figure 1

Recommendation system model based on clustering analysis

4.1 Cluster Analysis

The purpose of cluster analysis is to gather customers with similar consumption behavior patterns in a cluster. The transaction database structure shown in Figure 2a can be simplified to the example database structure shown in Fig. 2. Therefore, it is necessary to combine multiple transactions of a customer into a single record to reflect the customer’s consumption behavior. The process is illustrated in Fig. 1. After the merging, the transaction database shown in Fig. 2a is reduced to the database shown in Fig. 2b, and the reduced data set is submitted to the cluster analysis module to classify the customer groups into different clusters.

Fig. 2
figure 2

Transaction data record consolidation

4.2 Collaborative Filtering

In order to determine which products to recommend to the target customers, the collaborative filtering process needs to be completed in the following steps.

  1. (1)

    Identify the target customer’s cluster

If the target customer is an existing customer, then the customer’s cluster can be identified based on the customer code. If the target customer is a new customer, then the customer should first answer online which products he is interested in, and then analyze the target customer’s cluster according to the results of the customer’s answer. The method of analysis is to see which cluster’s product collection the target customer is interested in has the largest proportion of overlap with the target customer’s product collection.

  1. (B)

    Determining the set of candidate products to be recommended to the target customer

The consumption preferences of customers in the same cluster are similar, so if a product is loved by a customer in a cluster, but the target customer has not yet purchased it, then the product may also be needed by the target customer, and the reason why the target customer has not purchased the product may be because he is not aware of his need or simply does not know that the product exists on the website. Let \(TU_{1} \cdots TU_{i} \cdots TU_{n}\), represent the set of goods that each customer in a cluster has purchased, and \(TU_{{\text{target }}}\) denote the set of goods that the target customer has purchased, then the set S of goods that have been purchased by other customers in the cluster and not by the target customer is

$$S = TU_{1} \cup TU_{2} \cup \cdots \cup TU_{i} \cup \cdots \cup TU_{n} - TU_{{\text{target }}}$$
(6)

According to the basic principle of collaborative filtering, such a product cannot be recommended, and therefore, such a product should not be included in the set of recommended candidates. Let \(\alpha\) be the empirical threshold, only when the number of customers who have purchased a certain product reaches a certain percentage, the product may belong to the set of candidate products recommended to the target customers \(S_{{{\text{candidate}}}}\); therefore, \(S_{{{\text{candidate}}}}\) is calculated by Eq. (7):

$$S_{{{\text{candidate}}}} = \left\{ {{\text{item}}_{i} \frac{{{\text{card}}({\text{item}}_{i} ,U)}}{{{\text{card}}(U)}} > \alpha ,{\text{item}}_{i} \in S} \right\}$$
(7)

The \({\text{card}}({\text{item}}_{i} ,U)\) indicates the number of customers in cluster U who have purchased item \({\text{item}}_{i}\) and \({\text{card}}(U)\) indicates the number of customers contained in cluster U.

  1. (C)

    Using collaborative filtering technology to select the N items that the target customer is most likely to buy from the set of candidate products and recommend them to the target customer.

The fact that two customers belong to the same cluster indicates that most of the products they care about are the same, but this does not mean that their evaluation of the products is also similar, for example, a customer buys a certain product, but after using it, he has a very low evaluation of the product, while another customer may be the opposite [31]. Therefore, we also need to select customers from the cluster who have similar consumption behavior patterns with the target customers. Let β be the empirical threshold of similarity; we use formula (1) to calculate the preference similarity r(x,y) of each customer x in the cluster and the target customer y, respectively; if r(x,y) > β, then the information of customer x can be used to make predictions about the target customer, and we call such a customer a neighbor customer.

Using Eq. (2), we can predict the target customers’ evaluation of each product in the candidate product set separately, and take the top N products with the highest evaluation value and recommend them to the target customers.

5 Experimental Results

5.1 Experimental Setup

We have verified the method of this paper using the data provided byhttps://www.cs.umn.edu/Research/Grouplens/data/million./ . From the experimental data, we can see that there are a large number of products (3900 movies), and the number of movies rated by each customer is only a small percentage of them. Thus, it is difficult to determine the candidate set of recommended products and to identify the neighboring customers of the target customers.

We randomly selected 10 customers as the target customers and used the method proposed in this paper to make simulated recommendations, recommending 5 items to the target customers each time. Table 3 shows the average recommendation accuracy with different empirical thresholds β for 7 clusters and α = 80%. The experimental results show that the method proposed in this paper has high recommendation accuracy. In the table, if the empirical threshold β is too large, many customers are excluded from the neighboring customers, and the information available for collaborative filtering calculation is very little, so the recommendation accuracy is reduced; on the contrary, if the empirical threshold β is too small, then many customers who do not have similar consumption preferences with the target customers are also included in the neighboring customers, and therefore, the accuracy of recommendation is also reduced.

Table 3 Experimental results

This experiment uses Hadoop as the cloud computing platform architecture, with 6 parallel computing nodes and 1 management node, and each node is uniformly configured, 4 GB RAM, and 1 T SATA hard disk. Configure the environment through Python 3.5 and Keras 1.2; Eclipse is used for Java programming. The proposed algorithm and the baseline algorithm are used to mine the data used in the experiments and generate the corresponding rules, respectively.

5.2 Results

Fresh agricultural products need to spend a lot of cost in storage and transportation because of their easy damage and intolerant storage characteristics. In recent years, rural life has infrastructure protection, village to village highway, home to cable, but the integration of urban and rural areas in some areas is not in place, and the logistics nodes in rural areas are more scattered, resulting in increased distribution costs. In addition, live e-commerce requires high network requirements, but the lack of Internet broadband service penetration in rural areas leads to poor live streaming experience, which limits the development of rural live e-commerce. As the clustering effect of different e-commerce businesses shown in Fig. 3, it can be known that this paper proposes a framework model of e-commerce recommendation system based on clustering analysis, and the experiment proves that the model can effectively solve the problem of data scarcity faced by collaborative filtering technology, and the recommendation results have a high correct rate.

Fig. 3
figure 3

Clustering effect of different commodities

The network live industry spurt development, more and more people began to engage in live e-commerce. From the current agricultural products industry live content is limited to the process of picking and processing of agricultural products, the content is monotonous and homogenization is serious, which is easy to make the audience esthetic fatigue and difficult to attract more people to watch. At the same time, most farmers do not pay attention to the classification and grading of agricultural products, and lack of quality control awareness, resulting in a serious decline in the reputation of agricultural products sold on the Internet, making it difficult to create local brands of agricultural products. As the live broadcast effect shown in Fig. 4, we can know that rural live e-commerce is an effective way to innovate the way of selling agricultural products, increase farmers’ income and help rural revitalization. On the basis of traditional e-commerce, participants should be guided to realize that rural live e-commerce is not only a transaction form dominated by information, knowledge and technology, but also a live platform operation, live marketing and supply chain integration throughout the whole transaction process.

Fig. 4
figure 4

Live effect and economic income

“Live broadcast + e-commerce” increases the amount of agricultural sales. Compared with traditional e-commerce, webcasting can showcase goods and consumer interaction in a more comprehensive manner. The farmer-turned-anchor transforms the pictures, text and video information of the online store into visualized content, conveying the information of the products to consumers in real time, and at the same time, answering consumers’ questions during the live broadcast in time to increase the probability of consumers placing orders and boost the sales of agricultural products. As shown in Fig. 5, “live streaming + e-commerce” promotes the production process of agricultural products. Live webcast has real-time interactivity, which helps consumers understand all aspects of the products and optimize the shopping experience. Through live webcast, the whole process of planting, processing and transportation of agricultural products can be publicized, and story content such as entrepreneur’s experience, history of agricultural products and edible value of products can be added to enhance consumers’ stickiness and fan loyalty with high-quality live content, so as to gain consumers’ recognition of the products.

Fig. 5
figure 5

Agricultural economic income and the impact of live streaming

In order to ensure the reasonableness of the comparison, the improved algorithm does not set the target items, and the performance of the two algorithms is shown in Fig. 6 as the number of nodes increases with data increase. Slowly, especially after the improvement, the algorithm hardly changes. This is since the data block to be processed already has a rural e-commerce live broadcast task corresponding to it, and there is no need to wait for a rural e-commerce live broadcast task to be executed before starting a new one, which also indicates that the current system has the ability to process all data blocks in parallel at the same time. The results also show that the improved algorithm consumes less time.

Fig. 6
figure 6

Time consumption ratio of the improved algorithm

The collaborative data of cluster were copied and replicated to obtain data of the order of 1 million. Then, these data are applied to the experiments in turn. In order to make the experimental comparison clearer, the improved algorithm does not set the target term in the experiments, and it is known from the previous experiment that the current experimental data volume can achieve the best results using only 4 nodes, so 4 data nodes are used in this experiment, and the experimental results are shown in Fig. 7. From the Fig. 7, we can see that the ratio of the time taken by both algorithms decreases as the data volume increases, indicating that the larger rural e-commerce live broadcast crease. The system does not perform well when the data volume is small, because there are some necessary I/O operations in the system, including storage, communication, and coordination management.

Fig. 7
figure 7

Variation of algorithm analysis time with the amount of mined data

6 Conclusion

The live broadcast of e-commerce has entered a phase of rapid expansion as a result of the emergence of numerous live broadcast platforms against the backdrop of new media. Universities should provide a structured training program to teach social skills to new farmers under the direction of government programs. An e-commerce recommendation system can offer customers product information and suggestions based on an accurate identification of their consumption preferences, simulating sales staff to assist customers with the purchase process and preventing information “overload” for the customer.

Webcasting will eventually have real-time interactivity and improved data categories to aid customers in understanding the entire product and enhance the purchasing experience.

Use high-quality live material to increase consumer stickiness and fan loyalty and promote webcasting to promote the entire process of planting, processing, and transporting agricultural products. This will help consumers recognize the items.