1 Introduction

As a special type of social networking sites [1,2,3], online dating sites have emerged as popular platforms for single people to seek potential romance. According to a recent survey, nearly 40 million single people (out of 54 million) in the U.S. have been trying online dating, and about 20% of committed relationships began online [4]. Although some psychologists have questioned the reliability and effectiveness of online dating [5], recent empirical studies using the tracking data and survival analysis found that for heterosexual couples, meeting partners through online dating sites can speed up marriage [6]. Besides, one survey found that marriages initiated through online channels are slightly less likely to break than through traditional offline channels and have a slightly higher level of marital satisfaction for the respondents [7].

Mate choice and marital decisions, because of their importance to the formation and evolution of society, have drawn wide attention of scholars from different fields. Two hypotheses, potentials-attract and likes-attract, have been proposed to explain the preference and choice of long-term mates [8]. The potentials-attract means that people choose mates matched with their sex-specific traits indicating reproductive potentials: men pay more attention than women to youthfulness, health, and physical attractiveness of partners which are the characteristics of fertile mates, while women pay more attention than men to ambition, social status, financial wealth, and commitment of partners which are the characteristics of good providers. In other words, men tend to seek young and physically attractive women, while women pay more attention to men’s socio-economic status [9, 10], which is consistent with the Chinese saying “lang cai nv mao” for the choice of long-term partners [8]. In fact, analyzing gender differences of online identity reconstruction in an online social network revealed that men value personal achievements more while women value physical attractiveness more [11]. The likes-attract means that people choose mates who are similar to themselves in a variety of attributes, which is consistent with the Chinese saying “men dang hu dui”. From the perspective of evolutionary and social psychology [12], the difference in parental investment strategies determines the different mate selection strategies for both sexes [13]. Empirical studies on offline dating showed that mate choice is very much in line with the evolutionary predictions of parental investment theory on which potentials-attract hypothesis is founded [14, 15], while one research on a Chinese online dating site showed that mate choice is more consistent with the likes-attract hypothesis [8].

From a sociological perspective, compared with the offline environment, online dating largely expands the search scope of potential mates [16, 17]. The Internet allows users to form relationships with strangers whom they did not know before, whether through online or offline channels. For individuals who are difficult to find potential partners through offline channels, such as homosexuals and middle aged and elderly heterosexuals, the Internet provides an ideal platform for them to meet their partners. The preference of people for mate selection has been extensively studied [18,19,20,21], such as the preference on education level [22], age [23] and race [24, 25]. The matching pattern or the choice for potential mates, shows a homophily phenomenon [26, 27], that is, people prefer to choose mates who are similar to themselves. Three possible reasons lead to homophily. First, similar people are more likely to have the same hobbies and reach the same places, thus it is easier to see each other [17]. Second, there exists homophily for the relationship from the introduction of friends and relatives [28]. Finally, the similarity between partners can also be explained by individual preferences or cost/benefit calculation. By analyzing OkCupid data [21], Lewis found that although there is a similarity preference for partner selection, the preference is not always symmetrical for men and women. On some online dating platforms, users can browse the profiles of the other users anonymously, without leaving any trace of visit. A recent study on a major North American online dating site found that anonymous users viewed more profiles than nonanonymous ones, however nonanonymity can achieve better matching results [29].

Economists usually study mate choice and marriage problem from the perspective of game theory and strategic behavior [30,31,32,33,34,35]. Considering the difference of mate choice for both sexes in marriage market, Becker regarded the marriage matching problem of mate choice as a frictionless matching process, and by constructing a matching model, Becker proved that the mate choice is not random, but a careful personal choice of attributes [30, 31], which is later extended to a barging matching by Pollak et al. [32]. Marriage market is the first stage of a multi-stage game and corresponds with the Pareto efficiency of equilibrium. In the Internet age, Lee and Niederle launched a two-stage experiment in online dating market using rose-for-proposal signals [36], and found that sending a preference signal can increase the acceptance rate. Some other scholars also studied the mate preference from the economic perspective [37, 38]. For example, Fisman et al. found that male selectivity is invariant to size of female group, while female selectivity is strongly increasing in size of male group [37].

Computer scientists usually study online dating from the perspective of user behaviors [39,40,41] and recommendation systems [4, 42,43,44]. By analyzing online dating data, Xia et al. found that there exists distinct difference between preferences of men and women [41], and there also exists difference between users’ stated and actual preferences. Xia et al. also proposed a reciprocal recommendation system for online dating based on similarity measures [4]. For general social networks, gender differences lead to obvious differences in behaviors and preferences between men and women. Research on an online-game society showed that females perform better economically and are less risk-taking than males, and they are also significantly different from males in managing their social networks [45]. Another research found sex-related differences in communication patterns in a large dataset of mobile phone records and showed the existence of temporal homophily [46].

Although the research on mate choice, both offline and online, has been extended to many fields, the following problems still exist: (i) online dating sites are a special kind of social networking sites, but the most previous researches focus only on the users’ demographic attributes, and have not considered users’ network centrality in dating sites, which can be potential important factors associated with users’ mate selection; (ii) most studies focus on male and female preferences in mate choice, but they do not properly examine the compatibility of the two parties’ preferences; (iii) with the advent of big data era, the methods of machine learning, such as ensemble learning, have been widely applied to diverse fields to achieve good prediction performance. However, most of the existing literature still only uses the econometric methods to study users’ mate choice.

To address the research gap, in this paper, using empirical data from a large online dating site in China, we explore the users’ attribute preference compared with random selection, and use logistic regression to study how the users’ demographic attributes, popularity and activity and compatibility scores are associated with messaging behaviors, which reveal the gender differences in potential mate selection. We also use ensemble learning classifiers to sort the importance of various potential factors predicting messaging behaviors. At last we use correlation analysis to study users’ strategic behavior.

2 Dataset

This study is based on a complete anonymized dataset extracted in 2011 from a large online dating site in China for only heterosexual users. The dating site provides many features common to other popular online dating platforms: it allows users to set up a profile, browse the profiles of potential mates, be browsed by the potential mates, and send and receive messages. Specifically, when a registered member (user) A visits the dating site, at a specific position of his/her homepage, the site will recommend to him/her the members that he/she may be interested in according to certain rules. At this time, A can only see the members’ avatar (real photo), nickname, location and age. After A enters the members’ homepage, he/she can browse their detailed personal information without leaving the trace of visit. After that, if A feels very interested in some member, he/she will contact the member through the internal letters of the site. There are three data tables in the dataset, including female profiles, male profiles and the user behavior data. There are total 548,395 users in the dataset including 344,552 male users and 203,843 female users. The users’ profiles include 35 attributes, such as user ID, gender, birthday, education level, mate requirements and so on. The dating site requires the registered users to be at least 18 years old at the time of registration, thus on the platform the minimum user age is 18.

The behavior data about user recommendation and behavior information is in the form of triples: \(u_{a}\), \(u_{b}\), and action, where action has three possibilities, rec, click, and msg. rec means that the dating site recommended user \(u_{b}\) to user \(u_{a}\), click means that \(u_{a}\) clicked \(u_{b}\) for further personal information, and msg means that \(u_{a}\) sent a message to \(u_{b}\). There are totally 4,151,224 records in the user behavior data, and the numbers of rec, click and msg are 3,978,321, 138,502 and 34,401, respectively.

3 Results

3.1 Attribute preference analysis

3.1.1 Attribute difference distribution

In online dating, there are significant gender differences in terms of attribute preference, self-presentation and interaction [47]. Users usually have a certain preference for mates’ age or height. For both men and women, when they send messages to their potential partners, we compute the age difference as age(receiver) − age(sender), and the height difference as height(receiver) − height(sender). Figures 1 and 2 show the age difference and height difference distributions, respectively. As a comparison, we also show the randomized results by assuming that female(male) users randomly send messages to male(female) users.

Figure 1
figure 1

Age difference distribution. FM represents that female users send messages to male users and MF represents that male users send messages to female users. Solid lines represent the locally weighted polynomial regression fitting of their corresponding data points, and the gray interval represents a 95% confidence region

Figure 2
figure 2

Height difference distribution. FM represents that female users send messages to male users and MF represents that male users send messages to female users. Solid lines represent the locally weighted polynomial regression fitting of their corresponding data points, and the gray interval represents a 95% confidence region

In most times and places, women usually marry older men [48, 49]. Figure 1 shows that in modern Chinese society, on average, men prefer women two years younger than them and women prefer men two years older than them. However, the range of age difference that women accept is smaller than that of men: the minimum age women accept is that men are 11 years younger than them and the maximum age they accept is that men are 23 years older than them, while the minimum age men accept is that women are 25 years younger than them and the maximum age they accept is that women are 28 years older than them. If only the age difference distributions are considered, in line with previous findings from a range of cultures and religions [50], we find that the range of ages that women are willing to message is narrower than the range of ages that men are willing to message. Male and female preferences are not random; they seek potential dates with a smaller age difference than predicted by random selection, which shows the characteristic of likes-attract.

Figure 2 shows that generally the height difference for women sending messages to men (most are 12 cm) are larger than that for men sending messages to women (most are 10 cm) when choosing potential mates. In China, for men, the ideal height difference is that they are 10 cm taller than the person they message, while for women, the ideal height difference is that they are 12 cm shorter than the person they message. According to the data from Yahoo! dating personal advertisements, for users in the U.S., height also matters for dating, especially for females [51]. In Fig. 2, the height difference range for women is smaller than that for men: the minimum height women accept is that men are 3 cm shorter than them and the maximum height they accept is that men are 30 cm taller than them, while the minimum height men accept is that women are 13 cm shorter than them and the maximum height they accept is that women are 32 cm taller than them. Females show the characteristic of likes-attract in terms of preference for height. As is same with age, users seek potential mates with a smaller height difference than predicted by random selection, although the difference is not as obvious as age difference.

It is noteworthy that in the dating site, users’ characteristics are all self-reported. For impression management considerations [52], users can exaggerate their personal characteristics [53]. For example, a recent research on online self-reported height against objectively measured data in young Australian adults revealed that self-reported height is significantly overestimated by a mean of 1.79 cm for males and 1.29 cm for females [54]. Men lie more than women about their height, which is also found in the online daters of New York City [55]. We note that users seem to have not accurately reported their physical height in the dating site. In the dataset, the average heights of female and male users are 161.99 cm (\(\mathit{SD}=4.18\)) and 173.08 cm (\(\mathit{SD}=4.68\)), respectively. However, in real world the average heights of adult females and males in China are 160.88 cm and 169.00 cm, respectively, which means that female and male users can exaggerate their height by an average of 1.11 cm and 4.08 cm, respectively. After correcting these, we find that real height differences \(10-(4.08-1.11) = 7.03\text{ cm}\) for men, and \(12-(4.08-1.11) = 9.03\text{ cm}\) for women would be significant. However we also notice that in the dating site, the average ages of male and female users are 28.73 and 28.58 years old, respectively, while in the overall adult population in China, the average ages of men and women are 40.56 and 41.01 years old respectively according to the population census data. The dating population is younger than the overall adult population, thus is likely taller, and users may not exaggerate their height by quite as much as calculated.

3.1.2 Attribute preference

When a user sends a message to another user, his/her choice of recipient may not be random, but rather has some preference for certain attributes, such as preference for employment, education, income, and so on. To characterize the preference of sender with attribute i for receiver with attribute j, let \(m_{ij}\) be the number of messages sent from users with attribute i to users with attribute j, \(m_{i}\) be the total number of messages sent from users with attribute i, \(n_{j}\) be the number of receivers with attribute j, and n be the total number of receivers, then the attribute preference is \(p_{ij} = m_{ij} /m_{i} - n_{j} /n\). \(p_{ij}>0\) indicates that compared with random selection, senders with attribute i have a preference for receivers with attribute j, \(p_{ij}=0\) indicates that there is no preference and \(p_{ij}<0\) indicates negative preference, i.e. preferring not to select the receivers with attribute j.

Employment preferences are shown in Figs. 3 and 4 (see Tables 1 and 2 in Additional file 1 for the meanings of attributes and the number and proportion of men/women for each employment). We find that compared with males sending messages to females, when female users send messages to male users, there is a stronger preference for the employments of their potential mates. In Fig. 3, we find that women who are students, accountants, educators or in other uncategorized occupations are not preferred by men, while women engaged in design are slightly popular in terms of the relative amount of messages received, especially for men in aviation service industry. At the same time, we also find that in these data, men engaged in housekeeping only send messages to women in accounting and men engaged in translation industry only send messages to women who are private owners, which may be due to the small sample size of user behavior with respect to these attributes.

Figure 3
figure 3

Employment preference for male users sending messages to female users. The vertical axis indicates the male occupations and the horizontal axis indicates the female occupations. Preference values are represented by different colors

Figure 4
figure 4

Employment preference for female users sending messages to male users. The vertical axis indicates the female occupations and the horizontal axis indicates the male occupations. Preference values are represented by different colors

From Fig. 4, we find that the most popular professions for men are senior management, finance, education and private owners. Most people in these four occupations have high income or are well-educated. Unpopular male users are school students, salesmen and those engaged in other uncategorized occupations. At the same time, women engaged in chemical industry tend to seek men engaged in education and training, women engaged in sports tend to seek men who are private owners, and women engaged in police only send messages to men engaged in finance and real estate in these data, which may also be attributed to the small sample size of user behavior with respect to these attributes.

Education levels have a significant impact on mating and marriage [22]. Education level preferences are shown in Figs. 5 and 6 (see Tables 3 and 4 in Additional file 1 for the meanings of attributes and the number and proportion of men/women for each education level). In China, like in the other countries, postdoctor also refers to a position rather than an educational achievement. However, in many Chinese websites, when a user registers, postdoctor is also considered an education level beyond obtaining a PhD. Similarly we find that compared with males sending messages to females, when female users send messages to male users, there is a stronger preference for the education level of their potential mates. Figure 5 shows that men whose education level is below the undergraduate degree tend to look for women the same academic qualifications as them or lower than their qualifications, men with education level higher than bachelor degree but lower than doctoral degree tend to look for women with bachelor degree, and men with a PhD degree or postdoctoral training tend to look for women with graduate degree. In terms of preference for education levels, generally men show likes-attract characteristic. For female users sending messages to male users, Fig. 6 shows that men with undergraduate and graduate degrees are popular and, for most women, undergraduate males are more popular, but graduate females are more likely to seek potential mates with graduate degree. In terms of preference for education levels, generally women show potentials-attract characteristic. Research on a German online dating site revealed that preference for similar educational background increases with educational level. Females are reluctant to communicate with males with lower educational levels, however there are no barriers for males to contact females with lower educational qualifications [22].

Figure 5
figure 5

Education level preference for male users sending messages to female users. The vertical axis indicates the male education levels and the horizontal axis indicates the female education levels. Preference values are represented by different colors

Figure 6
figure 6

Education level preference for female users sending messages to male users. The vertical axis indicates the female education levels and the horizontal axis indicates the male education levels. Preference values are represented by different colors. Postdoctoral females did not send any message to men in the dataset, and we set the elements in the corresponding row to 0

Education level and income are two important indicators of a person’s social and economic status. From Figs. 7 and 8 (see Tables 5 and 6 in Additional file 1 for the meanings of attributes and the number and proportion of men/women for each income level) we find that, in terms of income levels, there is less obvious preference on potential mate selection for male users compared with female ones. On the one hand, as shown in Fig. 7, all men obviously prefer women whose monthly income is between RMB 5000 and RMB 10,000 (the RMB is the Chinese currency, and RMB 1 = 0.145 US Dollars = 0.128 Euros), while women whose income is below RMB 2000 are obviously excluded. However, men show no obvious preference or exclusion for women whose income is above RMB 10,000. On the other hand, as shown in Fig. 8, all women dislike men who earn less than RMB 5000, and men who earn RMB 10,000 to RMB 20,000 are the most popular. In terms of preference for income levels, generally women also show potentials-attract characteristic. A field experiment on a Chinese online dating site found that men visited the profiles of women of different incomes with roughly the same rates, while for women, the higher the male incomes are, the greater the rates of visiting their profiles will be [38], which is different from our findings.

Figure 7
figure 7

Preference for monthly income levels for male users sending messages to female users. The vertical axis indicates the male income levels and the horizontal axis indicates the female income levels. Preference values are represented by different colors

Figure 8
figure 8

Preference for monthly income levels for female users sending messages to male users. The vertical axis indicates the female income levels and the horizontal axis indicates the male income levels. Preference values are represented by different colors

3.2 Logistic regression classification

3.2.1 Compatibility scores

On users’ personal homepages, each user has shown the demands to the potential mates, including requirements for 7 attributes, i.e. age, avatar, education level, height, credit rating, place of residence and marital status (see Figs. 1–4 in Additional file 1 for the selection requirements of several attributes). As for credit rating, on the dating site, after a user passes the quick identity authentication, or uploads one of three documents (the ID card, the passport or the Hong Kong and Macau Pass) and passes the review, he/she will obtain the first star, i.e. credit rating equals 1. On the basis of the first star, each time a new document is uploaded and approved, an additional star or rating can be added (up to five stars, i.e. five-star member). Besides although on the platform the minimum age of users is 18, there are still very few users who set their requirement for minimum or maximum age below 18 (see Fig. 3 in Additional file 1 for details). We apply the concept of compatibility score to describe the match between users based on whether or not a user meets another user’s selection requirement. When women send messages to men, for each message and for each attribute, we can obtain the proportion of women who match the mate preferences of men and the proportion of men who meet the preferences of women, i.e. we can get two vectors including 7 proportions. According to the data we obtain \(\mathbf{w}_{\mathrm{FMm}}= (0.701,0.886,0.462,0.826,0.919,0.786,0.920)\), and \(\mathbf{w}_{\mathrm{FMf}}=(0.912,0.976,0.681,0.962,0.994,0.864,0.912)\), where \(\mathbf{w}_{\mathrm{FMm}}\) is the proportions of female attributes meeting male preferences and \(\mathbf{w}_{\mathrm{FMf}}\) is the proportions of male attributes consistent with female preferences. Similarly when men send messages to women, we obtain \(\mathbf{w}_{\mathrm{MFm}}=(0.877,0.977,0.402,0.980,0.992,0.831,0.960)\) and \(\mathbf{w}_{\mathrm{MFf}}=(0.671,0.867,0.572,0.678,0.758,0.771,0.892)\). Thus the compatibility scores of women sending messages to men are

$$\begin{aligned}& c_{\mathrm{FMm}} = \frac{\mathbf{w}_{\mathrm{FMm}} \cdot { (\textrm{female attr. in male pref.})}}{ {\operatorname{sum}(\mathbf{w}_{\mathrm{FMm}} )}}, \end{aligned}$$
(1)
$$\begin{aligned}& c_{\mathrm{FMf}} = \frac{\mathbf{w}_{\mathrm{FMf}} \cdot (\textrm{male attr. in female pref.})}{ {\operatorname{sum}(\mathbf{w}_{\mathrm{FMf}} )}}, \end{aligned}$$
(2)

and the compatibility scores of men sending messages to women are

$$\begin{aligned}& c_{\mathrm{MFm}} = \frac{\mathbf{w}_{\mathrm{MFm}} \cdot (\textrm{female attr. in male pref.})}{ {\operatorname{sum}(\mathbf{w}_{\mathrm{MFm}} )}}, \end{aligned}$$
(3)
$$\begin{aligned}& c_{\mathrm{MFf}} = \frac{\mathbf{w}_{\mathrm{MFf}} \cdot (\textrm{male attr. in female pref.})}{ {\operatorname{sum}(\mathbf{w}_{\mathrm{MFf}} )}}, \end{aligned}$$
(4)

where (female attr. in male pref.) is a vector characterizing whether female attributes meet male preferences for a pair of users (1 for yes and 0 for no), and similarly (male attr. in female pref.) is a vector characterizing whether male attributes meet female preferences for a pair of users. Equations 1 and 3 are the compatibility scores between a male preference and the profile of his chosen mate, and Eqs. 2 and 4 are the compatibility scores between a female preference and the profile of her chosen mate. For a pair of users, \(u_{a}\) and \(u_{b}\), we use a score, i.e. reciprocal score, to quantify how much the attributes of \(u_{b}\) match the preferences of \(u_{a}\) and how much the attributes of \(u_{a}\) match the preferences of \(u_{b}\). The reciprocal score between \(u_{a}\) and \(u_{b}\) is the mean of the compatibility scores of these two users, that is, for women sending messages to men the reciprocal score is \(\mathit{rs} = (c_{\mathrm {FMm}} + c_{\mathrm{FMf}} )/2\), and for men sending messages to women \(\mathit{rs} = (c_{\mathrm{MFm}} + c_{\mathrm{MFf}} )/2\).

3.2.2 Logistic regression

Let click be the number of times a user is clicked, msg be the number of messages received by a user, and rec be the number of times a user is recommended and shown on the other users’ homepages, we define \(\mathit{pop}_{1} = \mathit{click}/\mathit{rec}\) and \(\mathit{pop}_{2} = \mathit{msg}/\mathit{rec}\) which can characterize the popularity of a user based on actions. We also use PageRank centrality (\(\mathit{pop}_{3}\)) to quantify how focal or popular a user is in a network by considering all connections in the network. Attractive people, such as the people with advantageous demographic attributes and higher socio-economic status, tend to be more demanding than average people in terms of potential mate choice, which can be revealed in the preference analysis of income and education level in Sect. 3.1.2. Those who are perceived as attractive by attractive people can be even more popular/attractive. The variables used in the paper and their meanings are shown in Table 1.

Table 1 Variables and their corresponding meanings

We introduce several centrality indices, such as \(\mathit{pop}_{1}\), \(\mathit{pop}_{2}\), \(\mathit{pop}_{3}\), and indegree, to evaluate their correlation with messaging behaviors. It is noteworthy that the centrality indices are aggregated indicators describing users’ desirability or popularity, and users do not know their indices, nor do they know the indices of others. We use outdegree to characterize users’ activity level, and in the dating site, users also do not know the outdegree of other users. In reality, instead of using the indices to identify or select attractive partners, users will message another based on more specific clues, such as higher income, better education background, attractive photos or good demographic and socio-economic compatibility. In the paper, we will evaluate whether the indices are significantly associated with messaging behaviors.

Suppose \(p_{i}\) is the probability of sending messages for a female user i, \(1-p_{i}\) is the probability of not sending messages, then \(L_{f_{i}}=\ln(\frac{p_{i}}{1-p_{i}})\), i.e., for all women, \(L_{f}=\ln(\frac{p}{1-p})\). Similarly, suppose \(q_{j}\) is the probability of sending messages for a male user i, \(1-q_{j}\) is the probability of not sending messages, then \(L_{m_{j}}=\ln (\frac{q_{j}}{1-q_{j}})\), i.e., for all males, \(L_{m}= \ln(\frac{q}{1-q})\). We obtain logistic regression models as follows:

$$\begin{aligned}& L_{f} = \alpha _{1} + {\boldsymbol{\beta} }_{1} \cdot {\mathbf{attribute}} + \varepsilon _{\mathrm{1}}, \end{aligned}$$
(5)
$$\begin{aligned}& L_{m} = \alpha _{2} + {\boldsymbol{\beta }}_{2} \cdot {\mathbf{attribute}} + \varepsilon _{\mathrm{2}}. \end{aligned}$$
(6)

In this study, multicollinearity tests are conducted to find out independent variables among which the correlation coefficients are less than 0.5 (see Tables 7 and 8 in Additional file 1 for details). The logistic regression results for women sending messages to men are shown in Table 2. We find that almost all the variables are significant when only considering the attributes of women (model 1), i.e., the attributes of senders, but only housing and outdegree of women are positively associated with the probability of women sending messages to men. When only considering the male attributes (model 2), except male mobile phone verification and credit rating, all the others are significant and are positively associated with the probability of women’s sending messages. When considering the two parties’ attributes and compatibility scores (model 3), among the significant variables, female mobile phone verification, car ownership, credit rating and popularity levels (\(\mathit{pop}_{1}\) and \(\mathit{pop}_{3}\)) are negatively associated with the probability of women’s sending messages, while the other variables are positively associated. We find that, when women send messages to men, they are concerned about not only whether they meet the requirements of men but also whether men meet their own requirements.

Table 2 Logistic regression results for female users sending messages to male users

The logistic regression results for men sending messages to women are shown in Table 3. We find that when only the female attributes are considered (model 1), except female mobile phone verification, credit rating and outdegree, all the other variables are significant, but only female house ownership affects probability of male messaging in a negative way. When only male attributes are considered (model 2), all the variables are significant but only male outdegree is positively correlated with messaging behaviors, others negatively correlated. With all variables considered (model 3), except for female credit rating, outdegree, and the compatibility score between a female preference and the profile of the corresponding other side, all other variables are significant. Among the significant variables, female mobile phone verification, car ownership, popularity (\(\mathit{pop}_{1}\), \(\mathit{pop}_{2}\) and \(\mathit{pop}_{3}\)), male outdegree and the compatibility score between a male preference and the profile of the corresponding other side are positively correlated with messaging behaviors, while all the other variables are negatively correlated. In addition, by analyzing the significance of the two compatibility scores, we find that men only pay attention to whether women meet their own requirements when sending messages to women.

Table 3 Logistic regression results for male users sending messages to female users

As can be seen from Tables 2 and 3, for males or females sending messages, popularity of the other side is significantly positively associated with messaging behaviors. On the one hand, \(\mathit{pop}_{1}\) and \(\mathit{pop}_{2}\) values, according to their calculation method, represent a user’s local popularity. On the other hand, \(\mathit{pop}_{3}\) value, i.e. PageRank, represents the popularity of a user from a global perspective.

For females sending messages to males, \(\exp (0.390) = 1.477\) for male \(\mathit{pop}_{1}\) is larger than \(\exp (0.146) = 1.157\) for male \(\mathit{pop}_{3}\), and for males sending messages to females, \(\exp (0.462) = 1.587\) for female \(\mathit{pop}_{1}\) is also larger than \(\exp (0.141) = 1.151\) for female \(\mathit{pop}_{3}\). Thus, for both males and females, the other party’s \(\mathit{pop}_{1}\) is more important than \(\mathit{pop}_{3}\). Besides we also find that, when females send messages to males, \(\exp (0.390) = 1.477\) for male \(\mathit{pop}_{1}\) is less than \(\exp (0.462) = 1.587\) for female \(\mathit{pop}_{1}\) when males send messages to females, which indicates that compared with females, for males the other side’s \(\mathit{pop}_{1}\) is more associated with their messaging behaviors. However, when females send messages to males, \(\exp (0.146) = 1.157\) for male \(\mathit{pop}_{3}\) is larger than \(\exp (0.141) = 1.151\) for female \(\mathit{pop}_{3}\) when males send messages to females, which indicates that compared with males, for females the other side’s \(\mathit{pop}_{3}\) is more associated with their messaging behaviors.

In China, having an apartment and a car is a symbol of a person’s wealth and social status, and in some regions, they have become necessities for getting married. When women send messages to men, it is important for men to have a house and a car. When men send messages to women, it is not important for women to have a house but it’s somewhat important for women to have a car. We find that \(\exp(0.038) = 1.039\) for whether the other side has a car when men send messages to women is smaller than \(\exp (0.157) = 1.170\) for whether the other side has a car when women send messages to men, indicating that women pay more attention than men to whether the other side has a car.

A user’s outdegree quantifies the user’s activity. Seemingly high activity means contacting many other users, however, essentially it may imply that users invest more time and resources in attempting to find potential partners. Outdegree is an attribute different for men and women. When a woman sends a message to a man, the other side’s outdegree is significantly positively associated with the messaging behavior, while not when a man sends a message to a woman. When women send messages to men, network measures of popularity and activity of the men they contact are significantly positively associated with their messaging behaviors, but when men send messages to women, only the network measures of popularity of the women they contact are significantly positively associated with their messaging behaviors.

3.3 Ensemble learning classification

With the advent of the big data era, ensemble learning classification methods have gradually been introduced into the field of social network research. As early as 1996, Breiman proposed the method of bagging [56], and five years later, he further proposed the method of Random Forest [57]. Freund proposed the AdaBoost method in 1997 [58], and with the continuous improvement of machine learning classifiers, in 2016, Chen et al. proposed a classifier—XGBoost [59], which can greatly improve the efficiency and accuracy of algorithm in some cases. As an application, recently Reece et al. have already applied machine learning tools to identify depression from Instagram photos [60].

Regression analysis often has certain requirements on the independent variables, such as the absence of multicollinearity, however ensemble learning classification methods relax the constraints on independent variables. In this section, ensemble learning classification methods including bagging, Random Forest, AdaBoost and XGBoost are used to evaluate the importance of each attribute in Table 1. We use package ‘adabag’ in R software to perform AdaBoost and bagging methods, package ‘randomForest’ to perform Random Forest method and package ‘xgboost’ to perform XGBoost method. For the dataset, 5-fold cross validation is used to assess the classifiers’ performance, and the algorithm parameters are chosen to obtain the stable error rate. The numbers of sending and not sending messages are unbalanced in the dataset, and the larger set is subsampled randomly to obtain a set the same size as the smaller one.

The error rates of four ensemble learning classification methods are shown in Table 4. We find that the error rates of Random Forest and AdaBoost are the lowest for females sending messages to males while XGBoost is the lowest for males sending messages to females. Attribute importance ranking is shown in Figs. 9 and 10. Figure 9 shows that when women send messages to men, the three most important attributes are the \(\mathit{pop}_{3}\) and \(\mathit{pop}_{1}\) values for men, and the outdegree for women. Similarly, Fig. 10 shows that when men send messages to women, the three most important attributes are the \(\mathit{pop}_{3}\) and \(\mathit{pop}_{1}\) values for women, and the outdegree for men. The most important factors predicting the decision of sending messages of both men and women are the \(\mathit{pop}_{3}\) and \(\mathit{pop}_{1}\) values representing the popularity of potential mates, which are also significantly positively associated with messaging behaviors in the logistic regression.

Figure 9
figure 9

Attribute relative importance rankings when women send messages to men for different classification methods. The horizontal axis indicates the attributes and the vertical axis indicates their corresponding importance. For bagging, Random Forest, and AdaBoost, the relative importance of each variable in the classification task is measured by the Gini index, and for XGBoost the relative importance is measured by the Gain parameter

Figure 10
figure 10

Attribute relative importance rankings when men send messages to women for different classification methods. The horizontal axis indicates the attributes and the vertical axis indicates their corresponding importance. For bagging, Random Forest, and AdaBoost, the relative importance of each variable in the classification task is measured by the Gini index, and for XGBoost the relative importance is measured by the Gain parameter

Table 4 Error rates using ensemble learning classification methods

The purpose of ensemble learning classification is different from logistic regression analysis. According to Figs. 9 and 10, the centrality indices indeed show the overwhelming importance, and the other variables show the relative lack of predictive power. However this does not mean that the other variables are useless, and they can still be significantly associated with users’ messaging behaviors in logistic regression.

3.4 Strategic behavior analysis

The concept of strategic behavior [61] derives from economics, where the original implication is that firms take action that affects the market environment to increase profits (referring to the message response rate in this study), which is then extended to matching problems [35], such as mate matching.

In our research, strategic behavior refers to whether a user will send a message to another user depends on whether his/her decision may increase the reply probability of the message. Since without user response data, we would like to use centrality indices characterizing user popularity to analyze whether users tend to send messages to people who are more popular than themselves or to those who are less popular. We study the users’ strategic behavior by analyzing the correlation between centrality indices. Smoothing fitting curves for the correlation with generalized additive model show that there is a nonlinear or approximate linear relationship between users’ centrality indices (see Figs. 5 and 6 in Additional file 1 for details), thus we use the Spearman correlation coefficient to characterize the correlation. As shown in Tables 5 and 6, We find that in the dating site men and women show different behavior patterns in messaging despite the reduced cost of rejection in the network environment. For males sending messages to females, there exist weak positive correlations between centrality indices, which can be characterized by small positive and significant correlation coefficients, while for females sending messages to males, there exist weak or modest positive correlations between centrality indices characterized by small or slightly larger positive and significant correlation coefficients. Men do not show strategic behavior to a large extent when sending messages, while for women, as their centrality indices increase, the corresponding indices of men who received their messages could also increase.

Table 5 Spearman correlation coefficients among centrality indices when females send messages to males
Table 6 Spearman correlation coefficients among centrality indices when males send messages to females

By studying the correlations between the same centrality index pairs for users, we further analyze whether users tend to send messages to people who are more popular than themselves or to those who are less popular. For each centrality index of senders, we give the mean and standard deviation of the corresponding receivers’ indices, and the proportion of the receivers’ centrality indices that are larger than those of the senders’ in Figs. 7 and 8 in Additional file 1. For each centrality index, Table 7 presents the proportion of the receivers’ centrality indices that are larger than those of the senders’ when sending messages. As a comparison, we also give the randomized results. Compared with men, more women tend to send messages to people who are more popular than themselves.

Table 7 The proportions of the receivers’ centrality indices that are larger than those of the senders’ when sending messages

There have been several studies on users’ strategic behavior in online dating. Some studies have found a significant positive correlation between the popularity of male and female users. For example, the research by Taylor et al. on the users from the U.S. showed that, they tend to select and be selected by other users whose relative popularity is similar to their own, although it does not necessarily mean a higher success rate, i.e. receiving more responses [62]. A recent empirical analysis of users in four U.S. cities from an online dating site used PageRank to characterize their desirability, and found that, both men and women sent messages to partners who are on average about 25% more desirable than themselves [63]. However, there are also some studies that have not found correlation between users’ popularity. For example, the research on users in Boston and San Diego did not find evidence of strategic behavior [33, 34]. Another research on online dating data from a midsized southwestern city in the U.S. revealed that, regardless of their own desirability levels which characterize users’ physical attractiveness, popularity, personableness, and material resources, both men and women tend to send messages to the most socially desirable users [20]. We find that users on different platforms or in different cultural contexts have different strategic behaviors, and the underlying mechanisms still need to be explored further.

4 Conclusion

In summary, we analyze online dating data to reveal the differences of choice preference between men and women and the important factors affecting potential mate choice. We find that, with compatibility scores considered, when women send messages to men, they pay attention to not only whether men’s attributes meet their own requirements for mate selection, but also whether their own attributes meet the requirements of men, while when men send messages to women, they only pay attention to whether women’s attributes meet their own requirements. When considering centrality indices, we find that for women, the popularity and activity of the men they contact are significantly positively associated with their messaging behaviors, while for men only the popularity of the women they contact are significantly positively associated with their messaging behaviors. At the same time, we also find that compared with men, women attach greater importance to the socio-economic status of potential partners and their own socio-economic status will affect their enthusiasm for interaction with potential mates. The machine learning classification methods are used to find the important factors predicting messaging behaviors. At last strategic behavior is analyzed and we find that there are different strategic behaviors for men and women. Although users do not know the centrality indices of themselves and their potential partners, compared with men, for women sending messages there is a stronger positive correlation between the centrality indices of women and men, and more women are inclined to send messages to people more popular than themselves.

This paper provides a foundation for gender-specific preference of potential mate choice in online dating. On the one hand, this study can provide references for the online dating sites to design better recommendation systems. On the other hand, an in-depth understanding of mate preference, such as the compatibility scores, can help users to select the most appropriate and reliable mates. There are still some limitations for the paper. Firstly, we lack the avatar or photo information and the body type data, and thus cannot evaluate the influence of users’ physical attraction and body mass index (BMI) on messaging behaviors [33, 34, 64, 65]. In fact, BMI can compensate for the disadvantages of wages or education [65]. Secondly, we only have the message sending data and lack the reply data, which makes it impossible for us to study the interaction between users. Thirdly, the lists of potential partners presented to users are generated by the recommendation algorithm of the website, not the result of users’ own search, and therefore could not reflect users’ preference well. Ranking effects caused by recommendation algorithms in online environments have been shown to influence the music people select [66] and the politicians people favor [67]. Fourthly we study the users’ attribute preference without considering the potential impact of other attributes. In real life, sending a message to another user is usually not affected by a single attribute. The additional attributes included in users’ profiles—their avatar, place of residence, and marital status—could also influence whether a message was sent or not, which means that the users’ preference for an attribute can be an illusion and may be based on other considerations. Fifthly, there are significant differences between Chinese and western cultures, and the website is only for heterosexual users, thus the conclusions of this paper may not be applicable to western society or homosexual people [68, 69]. Finally, people’s preferences for certain attributes in potential partners can change over time [70], while we only study users’ preferences in mate choice at a particular time. There are several avenues for future research. We can examine the influence of recommendation algorithms on potential mate choice in online dating. We can also use the results obtained in the paper to further study the problem of stable matching for potential mate choice. And by combining game theory with the real online dating data, we can further understand the users’ behaviors.