1 Introduction

The phenomenon of human migration has been a constant of human history, from the earliest ages until now. As such, the study of migration spans various research fields, including anthropology, sociology, economics, statistics and more recently physics and computer science. We are at a moment where various types of data not typically used to study migration are becoming increasingly available. These include so-called social big data: digital traces of humans generated by using mobile phones, online services, online social networks (OSNs), devices within the internet of things. At the same time, new technologies are able to extract valuable information from these large data sets. Both traditional and novel models and data are currently being employed to understand different questions on migration, including monitoring migration flows and the economic and cultural effects on the migrants and also on the source and destination communities. In this paper, we provide a survey of existing approaches, both traditional and data-rich, and we propose new methods and data sets that could contribute significantly to the study of human migration. We concentrate on three different phases of migration: the journey—analysing migration flows and stocks; the stay—studying migrant integration and changes in the communities involved; the return—the study of migrants returning to the origin country.

1.1 The journey

At the moment, information about migration flows and stocks comes from official statistics obtained either from national censuses or from the population registries. Given that migration intrinsically involves various nations, data are often inconsistent across databases and offer poor time resolution. With the availability of social big data, we believe it should be possible to estimate flows and stocks from available data in real time, by building models that map observed measures extracted from these unconventional data sources to official data, i.e. now-casting stocks and flows. We also look at migration phenomena within smaller communities, such as scientific migration, where even prediction of migration events can be possible. An important step in understanding migration flows is suitable visualization, which we also explore.

1.2 The stay

Migration might generate cultural changes with both long- and short-term effects on the local and incoming population. Migrant integration is generally measured through indicators related to the labour market, economic status or social ties. Again, these statistics are available with low resolution and not for all countries. A new direction is that of observing integration and perception on migration through big data. For instance, OSN sentiment analysis specific to immigration topics can allow us to evaluate perception of immigration. Analysis of retail data can enable us to understand whether immigrants are integrated economically but also whether they change their habits during their stay. Scientific data can help us understand how migration benefits both the host countries and the migrants themselves. Through these data, we can derive novel integration indices that take into account the traces of human activity observed.

1.3 The return

Besides effects on the receiving communities, the source communities may also see effects of migration. In fact, migrants can maintain a strong attachment to their home countries and eventually return there. This can bring multiple benefits: economic growth, new skills, entrepreneurship, better healthcare, different participation in governance issues and many others. We discuss various approaches to analysing these cases based on existing data.

Both traditional and new methods to analyse migration depend highly on the availability of data. Hence, infrastructures that can catalogue the various data sets and make them available to the community, ensuring privacy and ethical use, are very useful. At the same time, with new methods being developed, means of facilitating their use by the research community are necessary. An example of framework that aims to achieve these requirements is the SoBigData infrastructure [78] (www.sobigdata.eu). This includes a catalogue of methods, data sets and training material, grouped in so-called exploratories. Virtual research environments allow users to use some of the data and methods directly in the SoBigData engine. The exploratory on migration studies includes many of the methods and data sets presented below.

The rest of the paper is organised as follows: The study of migration flows and stocks is discussed in Sect. 2. This compares traditional data (Sect. 2.1) with social big data (Sect. 2.2) including scientific migration (Sect. 2.2.1), providing also a review of tools for visualization of migration data (Sect. 2.3). Section 3 concentrates on migrant integration and perception of migration. We start by looking at approaches based on traditional data sources (Sect. 3.1) and move on to social big data including retail data (Sect. 3.2.1), mobile data (Sect. 3.2.2), language and sentiment in OSNs (Sects. 3.2.3 and 3.2.4), ego networks (Sect. 3.2.5). The return of migrants is discussed in Sect. 4, while Sect. 5 concludes the paper with a summary and a discussion on ethical issues.

2 The journey: migration flows and stocks

In this section, we discuss various means of analysing migration flows and stocks. We start with traditional approaches and data types and then move to new data sets that can be employed for the task, underlining advantages and disadvantages of each approach.

2.1 Traditional data sources and challenges

Tracking international migrants’ flows and stocks is an important task but also challenging. At the moment, many researchers and policy makers rely on traditional data sources to study the journey of migrants. Such data sources come from either official statistics or administrative data. Studying the journey of migrants with these traditional data sources, however, come with various limitations as migration intrinsically involves various nations. For instance, the data are often inconsistent across databases as different countries employ various definitions of a migrant. A lot of efforts have been made so far from both researchers and international organizations to improve quality and harmonize traditional data sources [50, 148, 171]. International organizations such as the United Nations provide also guidelines and suggestionsFootnote 1 which countries should employ when dealing with migration statistics. In this section, each type of data source is described in detail and evaluated.

Census data and surveys are official statistics collected by institutions. They provide socio-demographic information of the population, including immigrants. However, the two types of data have different focus. The census data are collected once in five years or once in ten years, depending on the country. For example, the most recent data available in the USA is the 2010 census data, while in Europe the last census was performed in 2011. By the recommendation given by the United Nations,Footnote 2 countries should collect the data every year that ends with zero in order to establish a consistency across different migration data sets. But as the process of collecting data is expensive and time-consuming, some developing countries do not collect the data as it is recommended, creating inconsistency across different countries’ databases. The high cost is due to the fact that the majority of countries carry out door-to-door or phone interviews to a randomly selected sample of population to collect the data. For instance, the Chinese population is almost 1.4 billion,Footnote 3 so about 6 million enumerators are needed to conduct all the interviews. On the other hand, most European countries retrieve the data from administrative registries which makes the procedure faster [62, 149].

In the census data, migration-related information collected is the following: citizenship, country of birth, last place of residence as well as length of stay. However, depending on the countries’ characteristics of immigrants and the immigration system,Footnote 4 they do not use the same information to count the number of immigrants. In Europe for example, the focus is also given on different migrant groups depending on whether they are from the European Member States or third country.Footnote 5 On the other hand, the United States counts everyone born outside of their territory as immigrants. Yet, the recommendation of the United Nations defines an international migrant as “a person who moves to a country other than that of his or her usual residence for a period of at least a year”. The difference in the definition of immigrants creates incomparability across different migration data. Furthermore, information about returning migrants is not well captured through the census data. This is due to the fact that returning migrants are not obliged to declare their departure. In the leaving country’s data, they would simply exit from the data, meaning that information about these migrants is difficult to track.

Census data are usually published in aggregated form by the authorities that organized the census. Typically, immigration rates are made available at country or at most regional level. For instance, historical immigration data can be found on the websites of Eurostat [63], the WorldBank [165], OECD [164] and other local authorities and research institutions [61, 67, 95,96,97]. However, in certain situations having data with higher spatial resolution can be useful. Recently, the Joint Research Centre of the European Union published a data challengeFootnote 6 where they make available for research high-resolution immigration data from the 2011 census, for selected European countries. However, similar data are more difficult to obtain for other regions.

Surveys also collect information about the flows and stocks of immigrants, and they are retrieved more often than the census data. Unlike the census data, they are generally conducted to collect information on households, labour market or community, depending on their main purpose. As a result, there are very few questions related to migration. For instance, in the employment survey in France, there are two questions which are about country of origin and date of arrival. With these two details, it is difficult to infer the immigrants’ journey since a clear definition of immigrants cannot be established. As a consequence, it has low accuracy level in capturing immigrants’ flows and stocks and real-time observation cannot be done. In addition, information retrieved from surveys refers to a small subset of the entire population.

Administrative data are retrieved from registries. It can be from health insurance, residence permits, labour permits or border statistics, which gather also information about immigrants. Registry data can provide more detail and are less costly than official statistics as the information is intrinsically and directly given by the individuals. For instance, data collected from the residence permits include details about intention and length of stay. They also require specific details on place of origin and address in the country of stay. The same applies to labour permit data. Nevertheless, in Europe where the freedom of movement and work is established, it is difficult to know flows and stocks of EU immigrants using these administrative data unless all the individuals are registered. An alternative is to use health insurance data. With these, it is possible to infer the stocks more accurately, provided the immigrants register for health insurance. In addition, registries can also collect information about asylum seekersFootnote 7 and refugees.Footnote 8 However, this information is not always present in all migration data. In some countries like France, Italy, UK and so on, asylum seekers residing at least 12 months in a country are included in the data. In other countries like Belgium, Sweden and Finland, they are excluded [62]. Again, an application of different definitions makes it difficult to compare data across different countries. When studying the journey with administrative data, caution should be used when inferring the immigrants’ journey as it is difficult to identify the true movements of immigrants.

The use of traditional data in studying the journey of immigrants is definitely useful. These can be used for building models of migration [144] and understanding the determinants of migration. But for the reasons discussed above, several drawbacks have to be taken into account. To improve data quality, institutions provide estimates to impute the gaps between years, or use the double-entry matrixFootnote 9 firstly introduced by UNECEFootnote 10 to establish comparability across different nations’ data (see, for instance, [50, 142, 143]). Nevertheless, despite the efforts, the data still appear inconsistent and unreliable. With the availability of social big data sources, researchers hope not only to overcome the limitations of traditional data, but also to be able to conduct real-time analyses at a higher accuracy level.

2.2 Alternative data sources: Is now-casting possible?

In recent studies, the use of social big data in the study of immigrants’ journey is increasing. A variety of data types can fall under this category. They can be data from social media, internet services, mobile phones, supermarket transaction data and more. These data sets contain detailed information about their users. Furthermore, they cover larger sets of population than some of the traditional data sources, which are limited in terms of sample size. Yet, the literature points out that the data may be biased because of users’ characteristics in the sample. For instance with Twitter data, it is known that the majority of the users are young and that it cannot represent the whole population. Nevertheless, various studies state that the observed estimates of immigrants’ flows and stocks extracted from these unconventional data sources can still improve the understandings of migration patterns (see, for instance, [87, 126, 183]).

Big data allows researchers to study immigrants’ movements in real time. Twitter data, for instance, provide geolocated time-stamped messages. Geolocated messages are often the key variable in estimating the flows and stocks but not the only one. In the work of [183], the authors infer migration patterns from Twitter data by looking at where the tweets were posted. Other studies like [126] assume origins of immigrants from language used in tweets, whether the local language was used or not. These studies conclude that Twitter data allow researchers to localize the flows and stocks of immigrants and to observe recent trends even before the official statistics are published. The results of these studies are validated by matching the big data results to official data.

In one of our recent works, we have also analysed geolocalized Twitter data, with the aim of quantifying diversity in communities, by computing a superdiversity index [139] (see also Sect. 3). This index correlates very well with migration stocks; hence, we believe it can become an important feature in a now-casting model. A different line of work we are pursuing is that of estimating user nationality from Twitter data. As seen above, language can be important in understanding nationality; however, we believe that this can be refined by employing also the connections among users. The model can be validated with data collected through monitoring frameworks such as that presented in [21]. Once users are assigned a nationality, we can use these for a now-casting model of migration stocks. Additionally, we can define communities on Twitted based on nationality and study the flow of ideas among communities, and the role of migrants in the spreading of information. Furthermore, these data could enable analysis of ego networks of migrants (Sect. 3.2.5).

Skype ego networks data can also be used to explain international migration patterns [101]. In this case, the IP addresses that appear when users login to their account can be used to infer the place of residence. More precisely, they look at how often the users login to their IP address, which allows them to label the location as the users’ place of residence. The users’ place of residence then can be used to observe whether migration took place or not.

Big data can also be used to study movements of individuals in the time of crisis. For instance, [30] propose to use mobile phone data to trace individuals’ movements in the occurrence of earthquake in Haiti. With these data, the authors are able to trace users as the phone towers provide information about their locations. They conclude that big data can be used to observe movements in real time, which cannot be done through traditional data.

Another limitation in using traditional data source is that it is difficult to anticipate immigrants’ movement. In the work of [36], they study whether the GTIFootnote 11 can now-cast the immigrants’ journey. However, as authors point out, not every search means that searchers have intention to migrate. To address this issue, they compare Gallup World Poll dataFootnote 12 with the results obtained with GTI data. The Gallup data is a survey done on more than 160 countries and it contains questions on whether the individuals are planning to move to another country and if so, whether the plan will take place within 12 months and lastly, whether they have made any action to do so, i.e. visa applications or research for information. The comparison validates that the GTI data can indeed now-cast the “genuine migration intention”.

Unconventional big data have their limitations like traditional data. Nevertheless, new big data methods are developing in order to address the newly arising issues. In addition, big data cover worldwide users with very fine granularity of information on immigrants’ journey. The hope is that by merging knowledge from both traditional and novel data sets we will be able to overcome some of the issues and build accurate models for now-casting immigrant journeys and immigration rates.

2.2.1 Scientific migration

Given its importance to scientific productivity and education, the study of scientific migration has attracted a growing interest in the last years, fostered by the availability of massive data describing the publications and the careers of scientists in several disciplines [47, 129, 138, 155]. Understanding the mechanisms driving scientists’ decision to relocate can help institutions and governments manage scientific mobility, implement policies to attract the best scientists or prevent their departure, hence improving the quality of research. At the same time, predictive models explaining when, and where, scientists migrate can facilitate the design of job recommender systems for scientists based on their profile [156], or help search committees seek successful candidates for their research jobs.

The studies proposed in the literature on scientific migration can be grouped into three main strands of research. A first group of studies focus on country-level movements or on movements between universities [20, 125, 131]. Relying on a large-scale survey, Appelt et al. find that geographic distance, as well as socio-economic disparities and scientific proximity, negatively correlates with the mobility of scientists between two countries [14]. By investigating the professional and personal determinants of the decision to relocate to a new institution, Azoulay et al. [22] find that scientists are more likely to move when they are highly productive and their local collaborators are fewer and less accomplished than their distant collaborators, while they find it costly to disrupt the social networks of their children. Gargiulo and Carletti [68] investigate the movements of scientists between universities and find that starting from a lower-rank institution lowers the probability of reaching a top-rank academy and makes higher the probability to remain in a low-rank one and, on the contrary, starting from a high-ranked university strongly lowers the probability of ending in a low-ranked one.

A second strand of research focuses on understanding the impact of a scientist’s relocation to their scientific impact. In this context, it has been discovered that while moves from elite to lower-rank institutions lead to a moderate decrease in scientific performance, moves to elite institutions do not necessarily result in subsequent performance gain [52]. Sugimoto [161] analyses the migration traces of scientists extracted from Web of Science and reveals that, regardless of the nation of origin, scientists who relocate are more highly cited than their non-moving counterparts.

In the context of studying labour mobility, the availability of massive data sets of individuals’ career path fostered works on predicting individuals’ next jobs (outside the academia) [156]. Paparrizos et al. [135] build a system to recommend new jobs to people who are seeking a job, using all their past job transitions as well as their employees data. They train a predictive model to show that job transitions can be accurately predicted, significantly improving over a baseline that always predicts the most frequent institution in the data. Recently, Li et al. [111] propose a system to predict next career moves based on profile context matching and career path mining from a real-world LinkedIn data set. They show that their system can predict future career moves, revealing interesting insights in micro-level labour mobility.

Our recent work, conducted within the SoBigData projects, is placed on the line of conjunction of the aforementioned strands of research. In particular, we investigate how a scientist’s scientific profile influences the decision to move, based on a massive data set consisting of all the publications in journals of the American Physical Society (APS) from 1950 to 2009—360,000 publications, 3500 institutions and 60,000 scientists [98]. We approach the problem by constructing a two-stage predictive model. We first predict, using data mining, which scientist will change institution in the next year. We describe a scientist’s profile as a multidimensional vector of variables describing three aspects: the recent scientific career, the quality of scientific environment and the structure of the scientific collaboration network. From the constructed predictive model, we identify the main factors influencing scientific migration. Secondly, for those scientists who are predicted to move, we predict which institution they will choose using the performance-social-gravity model, an adaptation of the gravity model of human mobility to include the above-mentioned factors.

A different recent line of work in the SoBigData project is to understand, by using ORCID data, what was the effect of the Brexit referendum on scientific migration in and out of the UK. Preliminary results (still unpublished) show an increase in UK researchers moving from the EU to the UK and an increase in EU researchers moving out of the UK.

2.3 Visual analytics of migration data

The phenomenon of migration is strongly associated with human movement. Analysis of movement data is one active topics in Visual Analytics research. The monograph [11] systematically considers a variety of possible representations of movement data. Frequently used representations are trajectories (sequences of time-stamped positions of individuals), time series (e.g. counts of departing, arriving or transit visitors over time), and events (e.g. movement with abnormal speed or unusual concentration of moving objects). A special case of trajectories is a trajectory consisting of only two time-stamped positions, origin and destination of a trip. This representation is frequently used in migration studies, since more detailed information is often not available.

The following three main classes of techniques are applied for visualization of origin-destination (OD) flows: OD matrix [84], OD flow map [167] and a hybrid of a matrix and a map called OD map [182]. In an OD matrix, the rows and columns correspond to locations and the cells contain flow magnitudes represented by colour shades. The rows and columns can be automatically or interactively reordered for uncovering connectivity patterns. Disadvantages of the matrix display are the lack of spatial context and the limited number of different locations that can be represented. In OD flow maps, links between locations are represented by straight or curved linear symbols analogously to node-link diagrams. Various possible representations of directed links are discussed and evaluated by Holten et al. [92]. Flow magnitudes are shown by proportional line widths or by colour shades. OD maps [182] use a map-like grid layout with embedded maps that represent movement from/to selected locations to/from all other locations that correspond to remaining maps.

A straightforward approach to showing time-variant flows is to use multiple displays (e.g. OD matrices, OD flow maps or OD maps) arranged either temporally in animation or spatially in a small multiple display. Map animation is not effective [170] because the user cannot memorize and mentally compare multiple spatial situations. In small multiples, a limited number of spatial situations can be shown simultaneously; hence, this approach is not suitable for long time series. Clustering of spatial situations [13] can be used to reduce the number of distinct situations that need to be shown. A completely different approach is to show the time series of flow magnitudes separately from maps, for instance, as it is done in FlowStrates [38]; however, the spatial situations and their changes over time cannot be seen.

The paper [12] defines a workflow for analysis of long-term origin–destination data. The approach starts with aggregation of flows by origin or destination regions, directions and distances of move, and time intervals. Next, time intervals are clustered according to feature vectors composed from descriptors of all origins representing magnitudes of flows in all considered directions and distances. The proposed system enables exploration and continuous refinement of clustering results. The process is supported by space- (flow maps, diagram maps) and time-based (calendar showing temporal dynamics of situations by colours of dates) visualizations.

The techniques described in this section have been successfully applied or are potentially applicable to analysis of long-term migration data, for detecting patterns and changes of migration.

3 The stay: effects on communities, immigrant integration

The study of the effects of migration on the communities involved includes various traditional lines of study. Immigrant integration is a complex process that can reflect a progressive adoption of the norms that prevail in the destination country or a return to the habits of the home countries. Integration has been analysed from multiple viewpoints. Here, we outline some of these lines of work, with some recent examples, and we provide a few directions for development using big data. However, this section is not intended to be a complete survey of methods, since the complexity of the issue would require much more than a few pages to describe. For more comprehensive reviews on migrant integration, please see, e.g. [41].

3.1 Current practices

In general, immigrant integration and cultural changes have been traditionally analysed using census data, administrative registries and surveys. In this section, we describe the different criteria used for analysis. We start with a discussion of research studying social integration (social network, mixed marriages), and then, we move on to labour market integration and language adoption of immigrants. We conclude the section with a discussion of the effects of immigration on educational attitudes, on economic prosperity and on political attitudes.

The effect of the social network on migration was analysed by [118] using survey data on Mexican migrants to the USA. The richness of the social networks is shown to promote migration of low-skill migrants, while for communities where the social networks are weak, high-skill migrants are present. In terms of migrant integration, social networks in schools were analysed in European countries by [158]. They show that homophilic attitudes develop differently for immigrants and natives, with the former being positively influenced by multi-ethnicity in class. Ego networks of Turks and Moroccans in the Netherlands are studied in [172], using survey data. The authors show that in general closest friends come from the same ethnic group. The effect is stronger for women and those that are culturally more dissimilar to the natives.

In terms of marriage relationships, in the USA, marriage with whites is analysed for different ethnicities and education levels [145, 146]. Divorce rates are shown to be higher for mixed than for non-mixed couples in the Netherland, particularly for couples coming from very distant cultures [157]. The relation between mixed marriages and the immigration rate in Italian communities was studied by [4]. The authors show that there are differences between large cities and smaller municipalities, and they argue, based on probabilistic interaction models, that this is due to the structure of the social network, which is disconnected in large communities. The presence of female immigrants was found to increase the risk of separation of native couples in Italy, using survey data and official statistics [177].

Integration in the labour market has been analysed for various western and non-western countries by [159]. They show that general patterns of integration and factors affecting it are very similar between western and non-western countries. Factors that affect the probability to find a job are language exposure, cultural distances, economic advancement of the origin country. Recent work shows that language training has an important effect on labour market integration of immigrants in France [112]. The effect of education on employment is analysed in [132] for Mexican immigrants in the USA. Integration in the labour market can also depend on the location where immigrants settle. In some cases, such as refugee situations, locations are assigned centrally. Recent work[24] has used data on past employment success to provide better matches between locations and refugees, showing that the probability of being employed can be increased by 40 to 70%.

Both mixed marriages and labour market integration were analysed using official data from Spain by [26]. They show using insights from statistical physics that while mixed marriages seem to be driven by peer interaction, this is missing when it comes to labour market integration. The same approach can be used to forecast integration from the two points of view [49].

Language adoption is a very important factor contributing to the success of an immigrant in the host country, since it provides opportunities for education, employment, social interaction. Integration in the USA was analysed by [6], looking at the language spoken at home by third-generation immigrants. The study shows that while Asians and European adopted the language at a similar pace, Spanish-speaking families were still preserving some of their mother language. A different study [173] looks at the dynamics of language adoption in the USA and shows that education is an important factor positively influencing speed of adoption, while group size provides negative influence. A related issue is that of naming children [1]. A recent study of early US census data shows that people coming from families where children were given foreign names were less successful in terms of education and earnings, and were more likely to marry foreign spouses. The bilingual settings were studied in [174], i.e. language adoption of immigrants in Belgium. The study shows that immigrants adopt faster the more international language.

The above-mentioned works study integration by looking mostly at the immigrant population. However, effects on the local population due to integration of migrants exist too. For instance, educational expectations of middle school children were shown to change in children both from native and immigrant communities, in Italy, based on survey data [122, 123]. Immigrant children increased their expectations in the presence of native children with high expectations. Native children studying in multiethnic classes seemed more prone to high expectations. The effect of school class composition and ethnic attitudes was analysed in [39], showing that a balanced composition is beneficial for all ethnic groups involved.

A different effect that can be studied is related to economic prosperity of the target society. Diversity of birthplace was shown to increase economic prosperity [7], especially in the case of high-skill migrants moving to rich countries. The cultural diversity of the origin country was also analysed, showing that there is an optimal cultural distance for immigrants to maximize the beneficial economic effects. At the same time, however, [25] show that competition in the labour market and public services, together with cultural differences, generates a shift in political inclination. For instance, a shift of votes towards the left-wing parties was observed in Italy. Similar changes were observed in Austria, where one factor was the concern about the quality of the neighbourhood [86].

Fig. 1
figure 1

Association to Italian Supermarket Chain. Trends of the number of customers with fidelity card for Albania, France and Romania

3.2 Towards a novel integration index using alternative data sources

While the type of studies exemplified in the previous section have been instrumental in understanding the effects of migration, the fact that they are based on traditional data makes them inherit the disadvantages of these data. Big data can help to analyse the issues above, and others, with the advantage of producing real-time results, and enabling analysis at higher spatial resolution. For instance, retail data can help understand how immigrants adopt habits and values of the new community they live in. Mobile call data records (CDR) can be used to describe social interaction and mobility patterns of immigrants, and understand segregation. OSN data can help study various topics, such as social integration, language adoption, changes in the local language and sentiment towards immigrants. All these data types can be also combined to build a novel multi-level integration index than takes into account all of these criteria. In the following, we will exemplify some of these topics, including existing results from our project and new directions to pursue.

3.2.1 Retail data: tell me what you eat, I will tell you who you are?

The measures for immigrant integration discussed in Sect. 3.1 capture choices that can be easily observed and potentially exposed to social sanctions. Moreover, they are usually measured at one point in time, while integration is a dynamic phenomenon. The analysis of retail data from a supermarket chain can enable us to understand whether immigrants are converging to or diverging from the norms and habits of the destination country. By observing immigrants’ food consumption baskets, we can estimate the degree of integration and how this varies in time. This behaviour is less prone to social sanction, since the food basket is not generally known to people outside a family. Furthermore, we can identify which are the most relevant factors for the integration. The degree of integration can be considered both with respect to economic aspects but also based on how immigrant customers change their habits during their stay in terms of purchased products.

Market basket analysis and the study of food consumption have been widely used in the literature for different purposes, such as defining individual indicators of customer predictability [79], studying GDP trends [80], analysing customers with respect to their temporal purchasing patterns [82] and classifying them as residents or tourists according to their shopping profile [81]. Exploiting retail data to study the migration phenomenon from an individual and collective point of view that is not exposed to social sanctions and with multiple observations in time can bring to the light novel results useful for better understanding the migration phenomenon and also for developing well-being policies.

Our project owns a key data source for these analyses, composed of scanner data from a large Italian retail market chain, that are available since January 2007 for more than 1.1 million customers holding a fidelity card. The data set includes the price, quantity, promotional sales (if any) and the name of the good purchased out of a set of around 600,000 products. Besides this information, for each customer the country of birth is available and the date on which the fidelity card was obtained. About 7% of the customers are foreign-born, when the immigration rate in Italy is currently around 8.5%. On average, a foreign customer is observed 5 times per month, with a mean monthly food expenditure of about EUR150. In Fig. 1, we report the cumulative number of customers joining the fidelity club for Albania, Romania and France. We observe how the trend is stable for Albania, while the number of customers with fidelity card is growing for Romania and decreasing for France. These indices are in line with the immigration trends from European official statistics, indicating that these data could be representative of the migrant population. In the following, we discuss research directions that our project is pursuing.

To understand whether there is a convergence in food consumption choices of immigrants (by country of birth), two orthogonal approaches can be followed. A top-down approach aims at analysing aggregated variables among the various items purchased that take into account for each foreign-born customer the difference between the normalized amount spent on a specific period and the mean spent in that period by Italian customers. In this way for each foreign-born customer we can obtain a time series indicating if that customer is converging or diverging from the Italian norms. Hence, we can find foreign countries having customers with homogeneous behaviours but also countries with different integration behaviours.

A weakness of the top-down approach is that it is not easy to understand which are the products leading to the convergence/divergence. A bottom-up approach analysing the basket composition can provide this kind of information. In particular, our idea is to extract for different periods for each customer their individual representative baskets using the algorithm defined in [83]. Then re-cluster for each country the representative basket of the customers and develop national collective representative baskets. This can allow, through a set-based distance measure, to develop an indicator of shopping divergence/convergence with respect to the Italians typical baskets.

Finally, we underline that 14 per cent of the foreign-born customers disappear from the data set after some activity. The purchases of these customers could also be used for studying the return to the origin country.

3.2.2 Call data records

A large amount of work has been done using call data records (CDRs) in understanding individual [70, 75, 137, 181] as well as group mobility [89, 114, 136, 168]. These range from empirical analyses of large CDR data sets [70, 75, 136, 137, 181] to proposal of theoretical mobility models [154]. Initiatives to motivate researchers to analyse CDR data have also appeared, through data challenges such as the Data for Development (D4D) challenge in Senegal [35] or the recent Data for Refugees (D4R) challenge in Turkey [152, 169]. Readers can refer to [34] for a survey of works related to using CDR data for individual mobility studies and models.

A recent example is the study of the flocking and mobility behaviour of the population after the Haiti earthquake using CDR data [113]. Researchers found that mobility patterns of the population after natural calamities is predictable. People tend to move to destinations where they have been making more calls before the disaster. In another natural calamity study done in New Zealand with respect to Christchurch earthquake which happened in February 2011 [2], the researchers found that people either moved to Big cities like Auckland or to the small towns. However, no correlation between the mobile phone calls before and after the disaster has been reported. In all cases, this is an important outcome, as it can help in timely and effective infrastructural decisions in the time of emergencies or natural disaster [51].

In a different dimension, mobility patterns have also been studied with respect to socio-economic development [136]. Authors found a strong correlation between human mobility patterns with socio-economic indicators. It has also been shown that mobility patterns can be used for creating detailed maps of population distribution which are more accurate and recent. This approach is in particular useful for poor countries. This in turn can help in creating proper socio-economic policies for the population [51].

However, while mobility analyses are abundant, not much work has been done to analyse the international migration phenomenon using CDR. This is due to several reasons. First, CDR data sets typically span only one nation. Secondly, in general, due to privacy reasons, no information on the nationality of the customer is provided. Without these pieces of information, studying migration with these data is difficult. One exception is the above-mentioned D4R challenge, where refugee status of customers is made available. Our project has participated in this challenge, together with several other teams, concentrating on five different aspects: health, integration, unemployment, safety and security, and education. For details on result obtained by other teams, please see the published collection of articles [151]. Our objective was to analyse integration and combine the Turktelekom data with other data sets [31]. We observed that integration seemed to increase in time for refugees and also that the presence of refugees influenced the house market in Turkey, decreasing housing prices.

Another recent example where CDR data were used to analyse transnational mobility is [5], using CDR data that includes mobile roaming events. Transnational population mobility can be defined as living and working in two or more countries. Understanding this phenomenon with traditional statistics and register-based data is impossible. The authors show that roaming data can enable the analysis of travel behaviour and social profile of visitors. They can differentiate between tourists, cross-border commuters, foreign workers and transnationals.

3.2.3 Language in online social networks

Language allows us to express needs, feelings and achieve our communication goals. Society changes and grows more complex over time; thus, language must evolve and adapt itself to the new needs of its population. As a consequence, this evolution leads to changes, creation and vanishing of expressions, dialects and even whole languages [74]. Over the past two decades, globalization has driven social, cultural and linguistic changes panorama in societies all over the world. The earlier multiculturalism, since the 1990s, intended as the ethnic minorities paradigm, turned in what Vertovec [176] calls Superdiversity. The concept aims to acquire the increasingly complex and less predictable set of relationships between ethnicity, citizenship, residence, origin and language. Thanks to the influence of pioneering works of linguistic anthropologists, mixing, mobility patterns and historical framework became key issues in the study of the languages and of the language groups [33]. Over time, linguists and sociologists analysed variation and changes in both oral [105] and written [29] languages by exploiting surveys, corpora and records [74]. In the last decade, the pervasive use of online social networking and micro-blogging services led to the availability of freely made contents never seen before. This unprecedented wealth of written data allows us to recover a detailed picture of language evolution from both the geographical and the time points of view [130].

The literature regarding the language in social networks applied to migration studies is wide and involves several research fields, including but not limited to mobility patterns, migrations stocks and flows, well-being and sentiment analysis. Even though some works focused more on metadata instead of the real data contents, the text bears a wealth of information, starting from the language in which is written [107]. For instance, Kulkarni et al. [104] have proposed a novel method allowing to detect English linguistic variation and quantify its significance among geographic regions; Ibrahim et al. [94] have combined different data to present a sentiment analysis system for standard Arabic and Egyptian dialectal Arabic; the language has been also investigated in the spatial distribution as well as the spatial extension of dialects. In [116], geolocated tweets are exploited to identify localized patterns in language usage and to analyse the language diversity over different countries; Mocanu et al. [124] have characterized the worldwide linguistic geography by aggregating multi-scale OSN data; Jurdak et al. [99] have compared Twitter mobility patterns with patterns observed through other technologies, e.g. CDRs, by using individuals’ spatial orbit as the measure of how far they move; Gonçalves et al. [74] have found two global super-dialects in the modern-day Spanish; and Doyle [54] have proposed a Bayesian method to build conditional probability distributions of the spatial extension of English dialects.

Fig. 2
figure 2

Superdiversity index (left) and immigration levels (right) across UK regions at NUTS2 level [139]

Within the SoBigData project, we have analysed the concept of Superdiversity theorized by Vertovec (2007) and proposed a measure to quantify it [139]. We focus on the conjunct analysis of both language and geographic dimensions starting from a Twitter data set. Our ground hypothesis is built on the idea that different cultures use the language in different ways and, in consequence, the emotional value associated with words changes depending on the culture of the person that writes a tweet. We introduced a Superdiversity Index (SI), which is based on the diversity of the emotional content expressed in texts of different communities. Specifically, we extract the emotional valences of words used by a community from Twitter data produced by that community. We compare the obtained valences with a standard dictionary tagged with sentiment. The distance between the community and the standard valences is a measure of superdiversity for the community. This SI measure is computed at different geographical scales based on the Classification of Territorial Units for Statistics (NUTS) for two different nations: Italy and UK, and validated with data from the above-mentioned D4I challenge (Sect. 2). We observe a very high correlation with immigration rates at all geographical levels. Figure 2 shows the case of the UK, where we observe that the geographical distribution of the SI proposed matches very well that of official immigration rates. Thus, we believe that, besides quantifying the cultural changes that migrants instil on the community, our SI can also become a key measure in a now-casting model for migration stocks.

3.2.4 Migration and sentiment

One way of studying migrant integration is by analysing the opinions of the locals related to migration topics and different migrant groups. While performing targeted surveys is one way of collecting such opinions, using online social networks (OSNs) is a novel direction that can overcome some limitations of survey data. Using Twitter for opinion mining and to study sentiment and user polarization is a vast subject [134]. The existence of polarization in social media was first studied by Adamic et al. [3] who identified a clear separation in the hyperlink structure of political blogs. Conover et al. [48] studied afterwards the same phenomenon on Twitter, evaluating the polarization based on the retweets. Most of the studies on polarization are still based on sentiment analysis of the content. The sentiment analysis methods proposed are numerous, and they are mainly based on dictionaries and on learning techniques through unsupervised [133] and supervised methods (lexicon-based method [163]) and combinations [103]. Opinion mining techniques are widely used in particular in the political context [3] and in particular on Twitter [45]. Recently new approaches based on polarization, controversy and topic tracking in time have been proposed [46, 69]. The idea of these approaches is to divide users of a social network in groups based on their opinion on a particular topic and tracking their behaviour over time. These approaches are based on network measures and clustering [69] or hashtag classification through probabilistic models [46] with no use of dictionary-based techniques.

Regarding the migration topic, in Coletto et al. [44] we propose an analytical framework aimed at investigating different views of the discussions regarding polarized topics which occur in OSNs. The framework supports the analysis along multiple dimensions, i.e. time, space and sentiment of the opposite views about a controversial topic emerging in an OSN, and is applied to the perception of the refugee crisis in Europe and Brexit. The sentiment analysis method adopted is efficient in tracking polarization over Twitter compared to other methods. Concerning other approaches for studying social phenomena, we do not base our analyses on the change of location of Twitter users to measure the flow of individuals through space, but rather we aim at understanding the impact on the EU citizens perception of migrants’ movements and their resulting decision to vote for Brexit.

Fig. 3
figure 3

Sentiment related to the refugee crisis across European countries (from [44]: red (dark grey in print) corresponds to a higher predominance of positive sentiment, yellow (light grey in print) indicates lower positive sentiment. a The whole data set. b Is limited to users when mentioning locations in the their own country. c Is limited to users otherwise (colour figure online)

The framework, initially presented in [43], allows to monitor in a scalable way the raw stream of relevant tweets and to automatically enrich them with location information (user and mentioned locations) and sentiment polarity (positive vs. negative). The analyses we conducted show how the framework captures the differences in positive and negative user sentiment over time and space. The resulting knowledge supports the understanding of complex dynamics by identifying variations in the perception of specific events and locations.

We used the Twitter Streaming API under the Gardenhose agreement (granting access to 10% of all tweets) to collect the English tweets posted in two periods: from mid-August to mid-September 2015 for the refugees data set, and from mid-June to the beginning of July 2016 for the Brexit data set, respectively. We filtered out the tweets not related to the specific events analysed. The first data set refers to the Refugees crisis and contains about 1.2 M tweets, while the second one refers to the Brexit referendum and contains about 4.3 M tweets. The data setsFootnote 13 are available for use through Transnational Access in the SoBigData project infrastructure.

In our study, we try to answer the following analytical questions: What is the evolution of the discussions about refugees migration in Twitter? What is the sentiment of users across Europe in relation to the refugee crisis? What is the evolution of the perception in the countries affected by the phenomenon? Are users more polarized in the countries that are most impacted by the migration flow? Is the polarization of the users about refugees and the Brexit referendum somehow correlated? For this purpose, we analyse the ratio between pro- and against-refugee users across Europe. For example, Fig. 3 shows the geographical distribution of this ratio considering all users residing in a country, but also internal and external perception (perception of the users residing inside/outside a country C related to the refugees in C). We observe that Eastern countries in general are less positive than Western countries. Also, we note that for internal perception Russia, France and Turkey have a really low sentiment. We conjecture that the sentiment of a person, when the problem involves directly his/her own country, could be more negative since we are generally more critical when issues are closer to ourselves. External perception is generally higher in countries most affected by the refugee crisis, such as France, Russia and Turkey, with the exception of Germany where the decision to open borders seems to have produced positive internal sentiment.

3.2.5 Ego networks and their effect on migration

Personal networks of migrants have been shown to play a strategic role in the destination country chosen by the migrant, in the well-being of the migrant (once settled in), and in the professional outcome [10, 65, 72, 175, 180]. For this reason, studying the properties of migrants’ personal network is a particularly promising avenue of research in digital demography, in order to characterize both the journey and the stay. In this section, we review the basic concepts of ego networks and some existing applications, and we argue that studying ego networks from OSN platforms can be a powerful tool in the analysis of migration.

It is a well-established result from sociology that personal networks, i.e. the ensemble of social relationships that an individual entertains with other people, have a significant influence on the quality of life of the individual in terms of, for example, job opportunities [76, 77], social support [100], power and influence in organization/communities [108, 121, 128, 153]. Personal networks are also closely related to the concept of social capital, i.e. the network of connections, loyalties and mutual obligations [72] that translates into favours and preferential treatment. In this perspective, studying the evolution of personal networks over time is the ideal approach to characterize the modification of migrants’ social structures (or lack thereof), due to the migration process. This is related to one of the main subjects of study in this area, i.e. the characterization of integration of migrants. Integration is typically measured in terms of assimilation and transnationalism. Assimilation is defined as the gradual adoption of customs and traditions from the receiving country by the migrant and can be full [8, 9], partial [71] or segmented [141]. As a consequence of assimilation, the composition of a migrant’s personal network is expected to change significantly over time. At the opposite side of assimilation, there is the phenomenon of transnationalism, whereby migrants continue to participate in the political, economic and cultural life of origin societies and of fellow migrants from the same country [140]. Many researchers have postulated that the widespread availability of Internet connectivity and OSNs has made easier to keep alive these transnational links with the origin country [109]. Again, this should be reflected into the personal network of migrants, in terms of number and relationship strength of links towards migrants and non-migrants from the same origin area. These changes can be studied using traditional data coming from targeted surveys, but also from OSN data that can fill some of the gaps present in survey data.

While most migration studies of personal networks are qualitative, quantitative studies are available in the literature on generic social networks. Quantitative studies often explore the graph-theoretical concept of ego networks. An ego network is the graph-based abstraction that models the personal network of an individual (called ego). Beside the ego, the nodes in the ego network correspond to the people the ego entertains social relationships with. These people are referred to as alters. The ego and each alter are connected by an edge, whose weight corresponds to the strength of their social relations (often referred to as emotional closeness). Depending on the ego network model used, ties between alters can also be included [64]. More rarely, only the alter–alter ties are considered for extracting ego network properties [117]. Several structural properties of ego networks can be derived [85].

Ego network models have been used in the literature to characterize human cognitive constraints and their impact on the social processes. In particular, evolutionary anthropology has studied the structure of ego networks (as a representation of human personal networks) in terms of the cognitive investment required from the ego to actively maintain it. Dunbar [55] has found that the humans’ neocortex size places an upper limit on the number of meaningful relationships that can be maintained. Specifically, the group size predicted by the human neocortex size is around 150 alters and it has been validated studying tribal, traditional and modern societies [58, 90]. This limit on the size of the ego network determined by the cognitive effort required to maintain active social relationships is known as the social brain hypothesis [57]. Additional investigations of this cognitive constraint have shown that the alters in the ego networks are organized into concentric circles around the ego, where the emotional closeness decreases and the number of alters increases as we move from the ego outwards [90, 186]. When looking at the size of the circles, a typical scaling ratio around 3 between the size of consecutive circles has been observed [186], with the size of individual circles concentrating around the values of 5, 15, 50, 150, respectively.

Quite interestingly, ego networks formed through many interaction means, including face-to-face contacts [57], letters [90, 186], phone calls [115], co-authorships [17] and, remarkably, also OSN, are well aligned with the above model. Specifically, very similar properties have been found also in Facebook and Twitter ego networks [19, 56]. In this sense, OSN become one of the outlets that is taking up the brain capacity of humans, and thus are subject to the same limitations that have been measured for more traditional social interactions, and are not capable of “breaking” the limits imposed by cognitive constraints to our social capacity [59]. Tie strengths and how they determine ego network structures have been the subject of several additional works. For example, in [73] authors provide one of the first evidences of the existence of an ego network size comparable to the Dunbar’s number in Twitter. The relationship between ego network structures and the role of users in Twitter was analysed in [147]. In general, ego network structures are also known to impact significantly on the way information spreads in OSN, and the diversity of information that can be acquired by users [15]. More in general, many traits of human social behaviour (resource sharing, collaboration, diffusion of information) are chiefly determined by the structural properties of ego networks [162]. Less studied (typically due to the lack of data) but equally important are the dynamic properties of ego networks, which characterize the evolution of personal networks over time. Arnaboldi et al. [16, 18] found that, unexpectedly, the strongest social relations in Twitter change frequently for the majority of generic users and also for the special class of politicians. This is a marked difference with respect to offline networks, where high-frequency relationships correspond to stable and intimate ties [90].

While data from OSNs have been recently used for migration studies, as detailed in previous sections, the graph-theoretical perspective has been rarely taken into account. The only exceptions are [88, 91], and [107]. In [88], community-centric metrics are used to study cultural assimilation as a function of the number of social ties between migrant communities and local people using the set of friendship links extracted from Facebook. The graph in this case is unweighted, i.e. the effect of different emotional closeness between node pairs is not taken into account. Lamanna et al. [107] again focus on cultural assimilation but from the spatial segregation standpoint. In this case, they use a bipartite graph structure, connecting tweet languages and cities. In [91], Facebook is used to study the network of teenagers in the Netherlands, concentrating on ethnicity and gender. The analysis shows that ethnicity plays a stronger role in link formation. However, the extended Facebook networks are less segregated, in general, compared to core ego networks.

To the best of our knowledge, ego networks of migrants built from OSN data have never been investigated in the related literature. This is quite surprising, as it is well known that many facets of the human behaviour chiefly depend on the ego network structure. This includes features intrinsically related to migration and integration, such as willingness to cooperate with alters, resilience to problems and possibility of seeking for assistance from trusted alters [160]. As discussed before, migrants’ ego networks have been studied previously in the sociology literature, but only traditional data sources had been considered, and the approach to the analysis is typically more qualitative than quantitative. Here we advocate, along the lines of digital demography, that it is crucial to integrate traditional and innovative data sources to provide a timely and deeper understanding of personal networks and their impact on the migratory phenomena. For non-migrant users, the integration of OSN data has already proven successful and has highlighted properties that would have been impossible to extract from offline data alone [56]. Given the role played by personal networks on migration flows and integration, we believe it is crucial to fill this gap. OSN is particularly appealing for accomplishing this task. In fact, they allow to reach scales far beyond what can be obtained from traditional data sources and they can also allow researcher to easily analyse temporal variations in the ego networks, ultimately allowing forms of now-casting of the migration phenomena.

Two research questions are particularly pressing: understanding and quantifying the relationship between the migrant’s online ego network and their migration choices, as well as measuring cultural assimilation and transnationalism through the evolution of online ego networks over time. With respect to the first question, it would be important to study the influence that alters in the different layers of migrants’ ego networks exert on the ego’s migration choices, distinguishing between the role played by weak and strong ties. These results can then be used to attempt predictions of the future migration choices of people, similarly to what is discussed in [98] for scientists. With respect to the second question, online ego networks can be a strategic asset for studying cultural assimilation, as they are typically easy to monitor for a prolonged amount of time, going beyond the single snapshot problem mentioned in [150]. As the migrant “moves” into the receiving society, we expect to observe a turnover in the ego network layers, reflecting the changes in his/her social relationships. This turnover can be measured in terms of similarity between layers across different temporal snapshots and observing the jumps that alters perform in the ego’s network (similar to what [18, 37] do for the ego networks of politicians and journalists on Twitter). Special attention should be reserved to the movements, inside the ego network, of co-nationals vs natives of the receiving country. Cultural assimilation predicts that the first class of ties should weaken progressively, while the latter should thrive. As a result, we expect to observe outward movements for co-nationals and inwards movements for natives inside the ego network. If this is not the case, we can postulate poor or imperfect assimilation and/or strong transnational ties linking migrants to their origin country.

4 The return: migrants returning to the country of origin

Migration is commonly seen as a permanent change in residence habits. However, when considered as a temporary phenomenon, several implications arise. Return migration is increasing in several countries, i.e. Mexico [40], China [185], Jamaica [166], Tunisia [119, 120] and Mali [42], with several effects observed. The most recent literature almost completely agrees in underlining the benefits led by returning migrants. These advantages concern a very wide range of fields and include the rise of business activity, and the wages increase [178, 179], the improvement of educational attainment and health conditions, the increase in electoral participation [42], and the decrease in violence [40].

The origin country can benefit economically from temporary migration in at least two ways [119, 120]. The authors show, taking the example of Tunisia, that money transfers from abroad to the migrant families are a sizeable income. Secondly, new skills learned and savings can enable return migrants to start their own business in the origin country. The SoBigData project also performed research in this field, with an approach based on data journalism that resulted in a documentary on return migration in Senegal: “Demal Te Niew” [23]. Zhao [185] has analysed the determinants of return migration and the economic behaviour of return migrants in China. Its findings result partially in mild contrast with those already discussed. The author found that out-migration is still dominant, while the return migration led by both push and pull factors is limited in scale. However, inspecting the employment-related field, the results show that return migrants invest more in productive farm activity. However, they do not show higher tendencies to engage in local non-farm activities than natives and migrants. As well as most of the literature, Zhao findings testify the return migrants key role in the modernization process of developing or less rich countries.

A lot of research has been focused on the “brain gain” provided by the return of high-skilled individuals, i.e. scientists returning in the country of birth. Scholars found that even if migration leads to a brain drain over the short-term, return migration can contribute to brain gain [53, 179]. Moreover, the most recent researches demonstrate that return migrants contribute to the own community’s long-term well-being independently by skills they have gained abroad [40].

Regarding the health field, Levitt et al. [110] have investigated dynamics between social practices gained abroad and health care. They show that social practices introduced by return migrants positively affect health care. These results seem related to the better social conditions of households with links to migrants and return migrants [60]. A different aspect relates to family-related decisions of return migrants. A recent study shows that Egyptian males returning from other Arab countries have more children than average [32], which could be due to the effect of the foreign culture on the decisions of the migrant.

The impact of return migrants on their origin country governance has been examined in [28, 42]. Results show that local policies are positively affected by returning migrants since these contribute to increase political participation and enhance political accountability. Political orientation of the home community can also be affected by the migration phenomenon. For instance, for Moldova, a recent study[27] shows how West-bound migration slowly changed the voting behaviour leading to the fall of the communist government in 2009.

Concerning education, research results agree that return migrants can be associated with increases and improvement of educational attainment. Taking the example of Mexico, Montoya et al. [127] have found an increase of 26% in school attendance in households linked to at least a return migrant. This could mean that return migrants give higher priority to education.

Although the study of return migration is a long-standing area, most, if not all, analyses are based on traditional data. There is, however, great potential in employing novel data types such as mobile data or OSN to study return migration, and it remains an open research area.

5 Discussion and conclusions

We have discussed three lines of research where social big data can complement existing approaches to provide small area and high-time resolution methods for analysis of migration. In terms of estimating flows and stocks, some research already exists trying to use social big data to now-cast immigration. However, models still need to be refined and validated. An important issue here is that a proper gold standard does not exist: exact current immigration rates are unknown, and those in the past can be noisy, so validation of now-casting models is not straightforward. Finding the relations between policies and immigration could be a step forward in finding means to validate model output. Another big data type that has not been included here and that can help make predictions in terms of migration related to climate is satellite data. To measure migrant integration, we believe that several new data types can be used to introduce novel integration indices, based on retail consumer behaviour, mobile data, OSN language, sentiment and network analysis. Research in this direction is slightly less developed, mostly due to low availability of ready-to-use data sets. Our consortium is making steps in this direction, using existing data sets, participating to data challenges or collecting new data. For the return of migrants, again research is limited, although potential exists in data such as retail, mobile or OSN.

In all three dimensions, research has to carefully consider the issues with the data that is being used. It is important that each study includes a well-planned data collection phase where available data are analysed to identify gaps and to devise strategies to fill the gaps by integrating other types of data. This in order to ensure that the problem being studied is thoroughly covered by the data used. In this process, research infrastructures such as SoBigData can be of great help. On the one hand, they can provide means to catalogue data, so that new data sets are available to the community for integration. On the other hand, they enable the community to share methods and experiences so that gaps identified and the solutions taken to fill these gaps can be reused. This applies not only to traditional data sources, but also to social big data. The complexity of digital demography implies that there is no free lunch with digital traces either [106]. One problem relates to the representativeness of the collected samples. For example, Facebook and Twitter penetration rates are different worldwide and tend to be different depending on the considered age of users [184]. Being unable to track specific categories of users can steer policies on migration in a direction that unwillingly perpetuates discriminations or neglects the needs of the invisible groups. For the above reasons, analytical and technical challenges to extract meaning from this kind of data, in synergy with more traditional data sources, remain an open and very important research area, with some recent efforts made in this direction [93]. Model validation using existing statistics and the relation to migration policies is important. Furthermore, careful data integration could help in overcoming some of the selection bias, resulting in novel, multi-level indices based on big data.

A different issue is that related to the ethics dimension of processing personal data, including sensitive personal data, describing human individuals and activities. As also stated in [187], the first rule that a researcher must follow is to acknowledge that data are people and can do harm. In particular, the context of migration is very sensitive to this problem, since individuals described in the data are often particularly vulnerable: refugees and their families might be persecuted in their home countries, so avoiding their re-identification is a critical matter. Moreover, mass media and social media impact our society and integration itself since a negative tone systematically relates to lower acceptance rates of asylum practices [102], so extreme care has to be taken in publishing results. Nevertheless, migration studies can have a significant impact to improve our society and to help the inclusion process of migrants; thus, encouraging data sharing is one of our main goals for achieving public good.

For all these reasons, it is essential that legal requirements and constraints are complemented by a solid understanding of ethical and legal views and values such as privacy and data protection, composing an actual ethical and legal framework. To this end, a number of infrastructural, organizational and methodological principles have been developed by the SoBigData Project, in order to establish a Responsible Research Infrastructure, allowing users to make full use of the functionalities and capabilities that big data can offer to help us solve our problems, while at the same time allowing them to respect fundamental rights and accommodate shared values, such as privacy, security, safety, fairness, equality, human dignity and autonomy [66]. In particular, we strongly rely on Value Sensitive Design and Privacy-by-Design methodologies, in order to develop privacy-enhancing technologies, privacy-aware social data mining processes and privacy risk assessment methodologies. These methods are developed mainly in the fields of mobility data (such as GPS trajectories), mobile and retail data, which are some of the (unconventional) big data used in our migration studies. Moreover, some other general tools have been implemented to assist researchers in their activities, create a new class of responsible data scientists and inform the data subjects and the society about our work and our goals, such as an online course, ethics briefs and public information documents.