1 Introduction

The contemporary era has been marked by at least three unparalleled trends. The first is the unprecedented mobilities of people (Urry, 2016); The second is unprecedented migration (United Nations, 2019); and the third trend, which forms the focus of this chapter, is the never before seen amounts of data and information that can be used to track internal and international migration (Haan, 2019). This includes traditional data from censuses and national surveys, but also administrative records and other new data sources. All three trends have sparked academic and policy debate, as well as agreement on the need to use data to not only track movement, but to also generate evidence-based policy interventions to meet the needs of migrants and facilitate their integration into adopted host communities.

Despite these three trends, it remains surprisingly difficult to track migration. This is, in part, due to an overall lack of awareness among researchers and policy makers of available data options and the pitfalls that come with using them. Few sources, moreover, offer practical and straightforward overviews of the decisions involved in the processing of data and the analytical gaps that are found in them. For these reasons, in this chapter, we map out options for understanding and analyzing migration through data.

In doing so, we first consider traditional Census and survey-based options from statistics agencies. We then explore the increasingly available option of using governmental administrative records. Finally, we assess other emerging big data sources that can be harvested from new media or private sector records. Along the way, we address some of the technical and methodological issues that arise from using data from different sources to study migration. These include the need for consistent definitions, limits in the scope of what can be examined, and the need for an awareness of issues around access and balancing necessity and proportionality around privacy. We also highlight the work that needs to be done to extend data created for one purpose, in the case of administrative records and big data, to be used for others, as is the case when these data are translated for research purposes.

In discussing these issues, we focus on Canada, our home country, but we also extend our observations to other countries. We conclude by advocating for the creation of data spines across sources and working towards shared national and international standards that will truly leverage the full potential of data on migration. There has never been a time more in need of migration research, or with as much information on the topic, as now.

2 Why Focus on Canada’s Immigration Data Ecosystem?

Global migration made international headlines around the world in 2015, with unseen numbers of people migrating into Europe as a result of the (largely Syrian) refugee crisis. The increase in the international flow of people became a focal point of attention for social scientists and policy makers. This is especially the case for immigrant-settler countries, such as Australia or Canada, where immigrants make up 28.2% and 22% percent of their populations, respectively (Australia Bureau of Statistics, 2017; Chavez, 2019). Increased migration has also been witnessed in non-traditional immigrant countries in Europe that opened their doors to refugees and newcomers at the peak of the refugee crisis.

With approximately 300,000 immigrants landing permanently in Canada each year, and over 500,000 temporary residents on the ground (Hussen, 2018) for most of the 2010s and an increase to 400,000 immigrants landing in the 2020s and about the same number of temporary residents, Canada has become a world leader in migration and immigration policy (Trebilcock, 2019). It also leads in terms of refugee resettlement, surpassing the United States in the total numbers in 2018, and settling far more refugees (as a proportion of the population) than other countries (Radford & Connor, 2019). Canada has developed cutting-edge immigration policy and has invested heavily in tracking migration and immigrant settlement to be able to understand how newcomers fare in the country.

Canada is also a leader in terms of data collection and innovative uses of this information to study migrants, as its data landscape is broad, and includes a wide range of sources. Specifically, the country’s censuses and national surveys collected by its national statistical agency, Statistics Canada, include immigrant status variables. Many other countries capture migrants in their data, as can be seen in the case of Australia, the United Kingdom, and the United States, but Canada is somewhat unique in its approach to administrative data on immigrants. Unlike many European or Scandinavian countries with population registers, such as Denmark or Sweden (Careja & Bevelander, 2018), Canada has no single and comprehensive administrative record to identify all people in its population. Instead, it has invested into processing and using a wealth of administrative data from Immigration, Refugees and Citizenship Canada (IRCC). These include using immigrant landing records and temporary resident records, as well as linkage of those records to other administrative databases, such as Canada Revenue Agency tax filing records or health records. Other countries without population registries have also invested in administrative data uses, especially to study income and employment issues, as can be seen in Australia or the United States, and/or migration flows, as seen in the United States or United Kingdom, and, like Canada, aim to link administrative records to other data to better capture migration (Ernsten et al., 2018; Rogers & McNally, 2018). Canada has also invested in linking its data across Censuses, multiple administrative records at the national level, and administrative records at subnational levels. Like other countries, researchers in Canada also collect their own data and are increasingly turning to Big Data from online or other sources. The goal of our chapter, however, is not to offer a comprehensive comparative scoping review of data available, nor be able to speak generally about practices globally, but rather, to share our experience with the Canadian migration data ecosystem and to highlight issues that arise within it.

Our insights and analysis are drawn from our nearly 50 years of collective experience working on migration and immigration issues. As such, we share some of the obstacles, solutions, and opportunities we see from navigating an increasingly complex data landscape. The chapter discusses issues encountered across three categories of data: (1) censuses and national surveys, (2) administrative data, and (3) emerging data sources. These groupings are used by the Migration Data Portal (2019), which was developed by the International Organization for Migration’s (IOM’s) Global Migration Data Analysis Centre, and offer a useful framework to frame our discussion.

3 Canada’s Data Ecosystem

3.1 Censuses and National Surveys

Probably the most widely used data source for studying migration and immigration is the census. It is an official count, or survey, of a population that aims to describe it and are usually done by countries every five or 10 years (Ruotsalainen, 2011). In Canada, data are collected through a combination of mailout surveys, face-to-face enumerations, and online surveys. Since 2006, rather than entering income data manually (which is prone to error or crude approximations), tax records have been used to provide economic information on individuals.

The Canadian census has two components: the short form, which in 2021 consisted of the 17 core questions asked to all participants, and the long form, which in 2021 had an extended set of questions sent to 25% of households (Statistics Canada, 2021a, b). With respect to migration, the long form of the census has information on mobility within the country over the last 5 years. It also captures mother tongue. Through census linkage to administrative records it also captures important factors, such as immigrant year of landing, admission category, the source country of immigrants. These were previously captured in the long form. Another strength of the census is that it also collects information on an individual’s other demographic, social, and economic characteristics. For these reasons, the census has long been a dominant source of information about Canadian immigrants. It has been used to study ethnic origins of immigrants (Boyd, 1999), earning disparities between immigrants and native-born citizens (Li, 2000), and home ownership (Haan, 2007), to name but a few examples. With the addition of immigrant admissions category since 2016 census, it is likely to become even more widely used in the future for research on immigrant settlement. These categories offer insight into the pathways immigrants used to land in Canada and, for example, offer opportunities to study differences between economic immigrants versus those arriving through family or refugee intake categories.

The census is also heavily used in other immigrant-settler countries, such as Australia and the United States. In fact, it is used by 149 countries around the world. Almost all, 87%, collect information on country of birth, however, about a quarter lack detailed information on citizenship (Migration Data Portal, 2019). The most prominent case where citizenship information is not collected is the United States, as seen through the controversy sparked by the Trump administration’s attempt to include it in the country’s 2020 Census (Mervis, 2019). Even fewer countries (roughly 50%) collect information on immigrant period of arrival (Migration Data Portal, 2019). Countries are also limited in terms of how many questions are asked in their census.

Despite this obstacle, there have been some innovative analytical strategies used to look at immigrants longitudinally, such as the ‘double cohort method’ (see Myers, 1999; Myers & Lee, 1998).

For countries where censuses are an option, there are at least three considerations when using the data. The first is that the census is a cross-sectional dataset. This means that one cannot directly study trends of immigrants across time. As a result, any changes observed cannot definitively be linked to cause in a previous temporal period (Borjas, 1993). Failure to recognize this can lead to erroneous conclusions, such as the “cross-sectional integration fallacy” identified by Hum and Simpson (2004). They observed the fallacy with respect to immigrant earnings, however, the mechanisms they identify apply to any causal inference made. The problem is that a quasi-cohort is created, which essentially compares people from a given group, say immigrants, who theoretically are of the same age and are then described or compared against in a later cohort. Because there is no one-to-one match, the comparison may fail to account for new immigrants or migrants that fit the same profile of the later cohort or other factors that make the groups different.

At least for Canada, a second problem occurs when focusing on issues of internal migration. The Canadian Census contains questions only on place of residence one and 5 years ago. This means it cannot capture migration that happened earlier than that period, nor can it capture the timing of moves that occur within the period. This means that the analysis of repeat migration within the periods is missed. Several researchers have commented on problems associated with this, and a good review of them can be found in Aydemir and Robinson (2006).

Censuses definitions also change over time, both in terms of specific questions as well as spatial units, which presents a third set of issues to wrestle with. Take, for example, the questions around ethnicity, visible minority, or occupation, which all have seen changes over the last 50 years. With respect to ethnicity, participants were discouraged from answering ‘Canadian’ for several years and the construction of ethnicity as a single versus multiple origin also changed (see Boyd, 1999). A direct measure of visible minority that allows for the capture of race only emerged in 1996 in response to being able to measure it as a part of the Canadian Multiculturalism Act and adoption of Employment Equity (see Boyd et al., 2000). The measurement of occupation has also changed across censuses, with the Standard Occupational Classification used before 2001 being replaced by the National Occupational Classification system, used from 2001 onward, and updated in 2006, 2011, and 2016.

In terms of spatial analysis, Census Tracts (CT), Dissemination Areas, and other geographic units also change over time. Such revisions are made due to new road construction, neighborhood growth, population growth within the CT or other units, and community development. In most cases, a CT, or other units, are split into multiple units over time, requiring researchers to recreate the original boundaries by aggregating the data if they want to study changes between multiple Censuses. In other cases, boundary revisions occur in ways that make the statistical ‘reconstruction’ of the original geographical boundaries laborious and/or impossible (Kaida et al., 2020). To address some of the challenges, the Canadian Longitudinal Census Tract Database has been developed to study neighborhood changes at the CT level using the 1971–2016 Canadian Censuses (see Allen & Taylor, 2018 for more details). Taken together, these issues also make longitudinal analysis more difficult and require a fairly high level of technical sophistication.

Methods for capturing information also change over time, which is a fourth problem to working with Census data. With respect to migration, immigration, and immigrant integration, two recent methodological changes affect how comparable data are over time. The first is the linking of administrative landing records to the census, which we will discuss below, to capture micro-categories of immigrants. While past censuses asked respondents to self-identify as immigrants, since 2016, immigrant variables are derived from landing records (Statistics Canada, 2016). Similarly, administrative tax records have been used in place of reporting income since the 2006 Census. Using census data over time was also complicated with the use of a National Household Survey in place of a census in 2011. During that year, participation was voluntary, leading some to argue that the data were worthless because of response bias (Hulchanski, 2014). It has since been discontinued as a survey and Censuses are again used. Nevertheless, this means that methods have changed through time, which makes comparison inconsistent over long period.

Another set of options for exploring issues of migration and immigration and immigrant settlement is looking at other surveys collected by national statistics agencies. With the Canadian case, these include the Ethnic Diversity Survey (EDS), the General Social Surveys (GSS), Labour Force Survey (LFS) or Canadian Community Health Survey (CCHS). In Canada, the GSS program was established in 1985 and is a series of independent, annual, voluntary, cross-sectional surveys, each covering one topic in-depth (Statistics Canada, 2019a). At the time of writing there are 35 cycles of the GSS. These surveys cover a wide range of topics including civic participation (Fong & Shen, 2016; Wong & Tézli, 2013), sense of belonging to Canada (Hou et al., 2018; Wong & Tézli, 2013) as well as discrimination and health (Nakhaie & Wijesingha, 2015), among other issues. The cycles focusing on identity are particularly useful for those studying immigration, as they tend to contain similar identity questions as those in censuses but offer more detail and have a wider range of variables to compare against. Unlike censuses, they tend to capture a sample of the population. Until 1998, the target sample size was 10,000, increased in 1999 to a target of 25,000, and, in 2015, this was reduced to 22,000 before being further reduced to 20,000 from 2016 onward (ibid). Though rather large, shrinking sample sizes do pose challenges for statistical power, especially when immigrant subgroups become the focus.

The EDS, by contrast, aimed to dig into the ethnic, racial and immigrant experience in Canada and to explore it through economic, political, and social and cultural spheres. It has questions on year of immigrant arrival and immigrant status and measures across these dimensions. It was conducted just once, in 2002, and had a sample of 42,000 people. The survey has been used to look at a wide range of issues, such as economic performance and the link to ethnic ties (Li, 2008), perceptions and experience of discrimination (Reitz & Bannerjee, 2007), and religion and integration (Reitz et al., 2009), to offer but a few examples. Although an old survey, it has been linked to more recent administrative data, which we outline below. Such linkage shows how a data spine approach, where administrative data act as the spine for other data to connect to, can preserve the life-span of older cross-sectional surveys.

LFS is another option, which is designed to capture employment and unemployment trends and are done monthly with the aim of offering information regarding job creation, education and training, as well as income supports and pensions (Statistics Canada, 2019b). The LFS is quasi-longitudinal, in that people remain in a rotating panel for 6 months, but it suffers from fairly small sample sizes. Yet another option for studying migration in Canada is the Canadian Community Health Survey (CCHS). The CCHS was originally designed to get a better sense of the health statuses of Canadians, but its relatively large sample size (of 65,000 participants each year since 2007, down from 130,000 from 2001 until then), its regular collection schedule of every 2 years, and its ability to look at both regular and special topics within health (Statistics Canada, 2022) make it a popular option among researchers. As it pertains to migration, however, the CCHS has many of the same problems as the other cross-sectional surveys listed here.

Household surveys, conducted by national statistics agencies, are also used in many other countries (Migration Data Portal, 2019). Portugal and Ireland, for instance, rely on household surveys alone to track migration in their borders (de Beer et al., 2010). Countries such as Australia or the United States have similar general surveys, as well as ones that focus on race or ethnicity. The United Kingdom has a Labour Force Survey that can capture those born outside the country and similar data are collected from the European Union Labour Force Survey carried out in 28 countries (Del Fava et al., 2019); many countries around the world conduct similar types of surveys.

Although GSS and other household surveys offer much potential to examine a wide range of topics, their sample size presents obstacles. This has a profound effect on the study of migration and immigrants in secondary or rural regions (see Ramos & Yoshida, 2011; Yoshida & Ramos, 2012; or Yoshida & Ramos, 2015). This is because the random sampling, and even the clustering strategies used by statistical agencies, often means that very few people from these regions are captured. A related problem is that the power of the models, and sophistication of analysis that can be done, is very limited because of the small sample. As a consequence, the areas that often most need analysis are missed, or are subject to very basic engagement through descriptive statistics. These issues add to many of the same obstacles faced with censuses.

Longitudinal surveys are yet other option available to researchers. They are useful because they contain repeated observations from the same individuals, allowing researchers to assess the longer-term migratory behaviours of people. Methodologically, this is advantageous because it allows researchers to compare the variation within individuals to variation across individuals. With such data researchers can asses how unique a particular individual is relative to her/his peer group, immigration cohort, visible minority group, and so forth.

In the Canadian context, The Longitudinal Survey of Immigrants to Canada (LSIC), which follows immigrants that landed in the country during the October 1, 2000, through September 30, 2001 period, is a somewhat dated but useful dataset (Statistic Canada, 2007). Like the EDS, it is linked to the data spine of more recent administrative data, which we discuss below. The strength of LSIC is that it contains comprehensive information about immigrants, tracking them 6 months after arrival (Wave 1) and then again at two (Wave 2) and 4 years after arrival (Wave 3). This overcomes the problems faced with Census panels or cross-sectional surveys by providing a one-to-one match of migrants and immigrants over time. It also allows researchers to examine a more detailed level of immigration categories compared to censuses before 2016 (the first year that admission category was included). The survey offers researchers the ability to assess education, racial and ethnic diversity, integration, labor market outcomes and population demography. It has been used creatively by researchers to examine the intersection of a number of immigrant experiences. For example, some have looked at immigrant language proficiency, gender and health (Pottie et al., 2008), while others have used it to analyze the experience of family pathway immigrants (VanderPlaat et al., 2013) and youth, as well as migration (Houle, 2007; Newbold, 2007; Yoshida & Ramos, 2013), among other uses.

Fewer countries have conducted longitudinal surveys off immigrants. Australia’s Longitudinal Survey of Immigrants to Australia (LSIA) and New Zealand’s Longitudinal Immigration Survey (LisNZ) are two comparable examples to the LSIC. The last cohort of immigrants captured in the LSIA was in 2004/2005 (Australia Bureau of Statistics, 2011) and the last cohort of the LisNZ was surveyed in 2009 (Stats NZ, 2019). Germany’s Socioeconomic Panel (SOEP), the Dutch LISS immigrant panel and the UK Household Longitudinal Survey are other examples that have longitudinal data on immigrants.

The LSIC, like other surveys, faces obstacles related to definition changes across waves as well as small sample size when it comes to studying recent immigrants in secondary regions. A more serious issue is that the survey, like other longitudinal surveys, suffered from high rates of attrition. The final sample across all waves of the LSIC was just under 8000 observations, down from the initial sample of 20,300 in Wave 1 (Statistics Canada, 2007). Careja and Bevelander (2018) note that the problem is particularly striking for immigrant populations with higher rates of attrition than other populations. Their review of longitudinal surveys found that about a third of immigrants, sometimes over half, leave panel surveys across countries.

Small samples in secondary regions and high rates of attrition mean that longitudinal surveys are very costly to run and are, therefore, often among the first datasets to be cut in times of austerity. In Canada, nearly every longitudinal survey that was once conducted by Statistics Canada has been cut. Thankfully, these surveys were cut at the same time as administrative data became more readily available.

3.2 Administrative Records

Administrative records are a rich and largely untapped source of data that policy makers and academic researchers can use to track migration. This is especially the case in countries that lack national registries. The decision to move to using administrative data in Canada was taken, in part, because of the elimination of the 2011 Census, which forced researchers and policy makers to look for alternatives. A positive consequence of the move was the creation of data linked to administrative records which have turned them into a data spine for other datasets in the statistical ecosystem. National registry’s play that role in other countries. We believe that administrative data can act as valuable spines to link other data. Linkage to administrative data is an approach that increases the power of older surveys, cross-sectional data, or small set data created by researchers, communities, NGOs, or industry.

For the study of migration and immigrant settlement, the Permanent Resident Landing File (PRLF) is one such administrative option for creating a data spine. Every landed immigrant to Canada has a landing record, often completed by immigration officers. This administrative data allows the Canadian government to collect and maintain information on newcomers to the country. The file is both large (it captures all newcomers since 1980), detailed (languages spoken, citizenship, education at landing, intended destination, size of immigrating unit, and admission category are only some of the variables on the file), and widely used for learning more about the country’s newest residents. There are millions of unique records in the file, allowing for a detailed assessment of how Canada’s efforts to recruit immigrants across different pathways have evolved over time. The disadvantage of the PRLF is that it only has information on immigrants at the time of landing, so it is not possible to learn about how immigrants are doing in Canada after that point without linking the data to other files, such as taxfiler data. This is possible because every newcomer has a unique identifier, which allows methodologists to find them in other datasets and link the files together.

Probably the best Canadian example of a successful linkage of the PRLF is the Longitudinal Immigration Database (IMDB). Linked to the T1 Family File (the main tax returns that taxfiling units submit to the Canada Revenue Agency on an annual basis), the IMDB contains all of the fields in the PRLF, as well as nearly every field that the Canada Revenue Agency requires taxfilers to submit. As with PRLF, every immigrant that has landed in Canada since 1980 is on the file, allowing for an analysis of economic outcomes from then onward. Since individuals are taxed differently if they’re married or have children, the IMDB also enables researchers to look at the composition of tax filing units. The data has been used to study employment and earnings outcomes (Hou & Bonikowska, 2016; Kaida et al., 2019; Warman et al., 2015), inter-provincial mobility and retention of immigrants (Haan et al., 2017), migration and immigration in secondary regions (Ramos & Bennett, 2019; Yoshida & Ramos, 2017), as well as analysis looking at the range of immigrant and refugee pathways not available in other datasets (Kaida et al., 2019; Yoshida et al., 2016).

A number of countries lacking population registries have also considered using administrative data to explore migration issues. The Australia Bureau of Statistics, for example, has linked data on net overseas migrations with visa grants information administered by the country’s then Department of Immigration and Citizenship (Temple & McDonald, 2018). The United States has assessed how Department of Homeland Security and United States State Department records can be used to study immigration (Grieco & Rytina, 2011) and has looked into how census data can be linked with Internal Revenue Service records (Akee & Jones, 2019). Researchers in the United Kingdom have also assessed how health data from the National Health Service can be linked to census-based longitudinal studies (Ernsten et al., 2018). Further, the United Kingdom aims to make administrative data the core of its immigrant and migration data infrastructure, while linking it to censuses and surveys (Rogers & McNally, 2018). Countries and researchers are increasingly moving toward administrative data to examine immigration and migration issues. Despite this trend, the Canadian case sheds light upon a number of obstacles with using administrative records.

Gaining access to administrative data is one of the biggest barriers researchers face in working with this form of information. Accessing the PRLF and the IMDB, in Canada, for instance, is especially difficult for those outside the Federal civil service or who are not affiliated with a Canadian university. Recently, Statistics Canada has created an interactive IMDB portal (see Statistics Canada, 2019c), but this does not allow researchers to access the microdata or conduct their own research.

Although access has improved in recent years, with the IMDB now housed in Research Data Centers located at most major universities across Canada, researchers still need to go through a security and screening process before gaining access, as well as being subject to vetting rules before data can be released. This slows the research process and limits who can use the data. The Research Data Centre approach is also one used by other countries, such as Germany, which has centres across the country (Bender et al., 2014). Because the data are complex and can identify individual immigrants, such protocols are not unreasonable.

Another obstacle in working with administrative data is that researchers and policy makers can only look at the issues that the data captures. More specifically, in the case of Canada’s IMDB, this means a focus on economic integration as well as mobility, failing to examine other non-economic issues (Costigan et al., 2016). Another downfall is that the focus of the IMDB is on Principal Applicants. They are immigrants who drive the application to settle in Canada with sparse linkage to the family that may come with them. Although the database does have family records on the taxfiler side of the data, issues still arise over being able to distinguish between the landing family, what linkage would be when immigrants land, versus the perpetual family, or how families evolve over time (Ramos & Bennett, 2019). Recent versions of the database have included a new family marker, however, it is still early days in looking beyond the individual as the unit of analysis. The database is also being linked to a number of existing surveys and other databases, like the GSS, the CCHS, or LSIC, through partnership with IRCC.

At the same time as researchers are beginning to discover the IMDB, Statistics Canada has moved towards making even more impressive data environments though its Secure Data Linkage Environment (SDLE). One such environment is the Canadian Employer Employee Dynamics Database (CEEDD). The CEEDD contains IMDB records and a long list of other administrative files. Some of these files include corporate tax return and owner files, records of employment, employer-issued earnings statements, and exporter/importer information.

Two additional obstacles with administrative data include struggles with the size and complexity. In the case of Canada’s IMDB files, when linked together, they exceed 30 gigabytes, which creates software and processing issues around analysis. This means using the dataset requires very advanced data processing and analytic skills, such as linking many different files, one-to-many merges, many-to-many merges, and the extensive use of lag variables. As such, it is not feasible for intermediate or non-specialist researchers. These are likely to be some of the longer-term issues surrounding the analysis of IMDB and many other administrative files for years to come.

Another source of administrative data that migration and immigration researchers can explore, and is also largely untapped, are sub-national administrative records. In Canada, these tend to come from provinces and territories. To date, some work has been done to explore how provincial health records can be used to assess immigrant retention and mobility within a province, as seen with recent work done in the provinces of Manitoba (Fransoo, In progress), New Brunswick (McDonald et al., 2018), and Ontario (Vigod et al., 2019). Most other provinces are also exploring how health data can be better harnessed to understand the experiences of their populations, including immigrants. Many provinces have also looked into how criminal records can be used as well. To fully maximize the use of sub-national records, it is important to be able to link them to the PRLF or other national records, such as those on taxfilers, as done in the IMDB, to gain a full longitudinal portrait of immigrants. Key to being able to do that is creating common data protocols and standards. The greater the linkage to a common base, the stronger and more comprehensive a ‘data spine,’ the more complete a portrait of experience researchers and policy makers will get. These linkages have been done in British Columbia, Manitoba, Ontario, and New Brunswick, at the time of writing.

Similar to the situation with the PRLF and IMDB, accessing sub-national records requires security and sensitivity around privacy. Currently, the Canadian Institute of Health Research helps coordinate health data and acts as a hub for health data linkage. However, this has largely been done for epidemiological purposes, and considerations of how health data can be used to study migration and immigrant settlement has created many ethical and policy debates (McDonald et al., 2019). It has also led to much discussion on which is the best avenue for researchers to access the data and to work on procedures for access and vetting. Additionally, the use of health data for studying migration and immigration may require, in some jurisdictions, reform of privacy laws. Most provinces in Canada have separate legislation, in addition to regular privacy acts, to protect health information of individuals. For example, researchers in the Province of New Brunswick needed to create a ‘Research Act’ to allow Health records to be used for purposes other than their original intention. This is an issue that also affects other forms of new data derived from social media, apps, or information gathered from smartphones, which we discuss below. Linkage across administrative record also requires working out authority and ownership of data, given that it links provincial records to federal records and national and sub-national policies, practices, protocols and standards may not always align. Linkage of educational and other sub-national records is still in the early days and few provinces have developed means for linking that data to other sources yet, but we expect to see significant progress in these areas in the next 5 years and will continue to build off of the national administrative data that can act as spines for other data across the country.

Another issue with sub-national administrative data is the relative lack of development of some of these sources. They are largely raw records that were not created for the purpose of research and, thus, require significant cleaning and coordination before linkage to national data or cross province analysis can be done. This creates obstacles for sub-national governments, especially in smaller and secondary regions, where civil servants often do not have the mandate, skills, or time to turn administrative data into a research-ready platform. Although this is also true of national sources, the issue appears to be more prevalent at the lower levels of jurisdiction. This is also a problem encountered in other countries. Researchers and policy-makers should, therefore, expect challenges when working with sub-national administrative data.

An area that has largely been unexamined, in terms of migration and immigrant settlement data, is information from municipalities, which are yet another level of sub-national data. As researchers and policy makers consider municipal records, one of the biggest obstacles is that, like other sub-national governments, most municipalities do not have the financial or human resources to process their administrative data. They in turn face challenges in linking it to provincial and national level data. Such data, like other administrative data, was not created with the purpose of studying migration and, as a result, also has obstacles in terms of access, shared standards, definitions, units of analysis, and providing adequate documentation that need to be smoothed if such data can be used for meaningful analysis. Despite these obstacles, according the United Nations (2018), over 55% of people worldwide live in cities, with major centers such as Toronto, London, New York or Berlin becoming global cities (Sassen, 2016), largely due to the flow of migration and immigrant settlement (Sanderson et al., 2015). Even smaller, secondary cities are increasingly linked to the world through immigration as well (Haan & Prokopenko, 2016). For these reasons, there is much opportunity in exploring municipal administrative records. It is on this front that the academic research community has much to offer, especially from those working in computer science and the social sciences.

3.3 Other Data Options

Researchers also have much to offer in terms of collecting their own surveys, as has been traditionally done, as well as using new sources of information to understand migration and immigrant settlement, such as mobile phone data or processing administrative records from service providers or other NGOs. These are far too numerous to fully enumerate in this chapter and the goal was not to offer a scoping review, but rather discuss how they fit in the data ecosystem and issues that are faced in different corners of it. So, here we offer some insight on additional options, rather than going into full detail on any specific data sources. We focus on surveys generated by the academic research community, Big Data, and NGO data.

Original surveys driven by researchers play an important role in filling gaps moved by those collected by national statistical agencies. They can explore topics that are not covered by their surveys and administrative data and are nimble in terms of development and delivery. One of the key obstacles with such work, however, is an inconsistency in how basic units of analysis, such as immigrant or refugee, as well as other parameters, are defined across studies (de Beer et al., 2010; Pritchard et al., 2019). This, in turn, makes comparison across studies difficult and, more importantly, as with inconsistency in government data sources, it makes longitudinal comparison difficult and also makes linking to national or international data sources next to impossible. Here again, investing in common data standards and protocols help leverage the data. Another obstacle with such initiatives is that many researcher-led surveys have small samples due to cost and other constraints, which make their power weak. It is, thus, important for researchers to consider how they can use comparable questions to those of national and international instruments and how they can be linked to them through common geographies or other units. If they do so, they can tap into larger samples and generalize results beyond the confines of results harvested.

Policy makers and researchers are also turning to Big Data, or information that can be gathered from social media, apps, smartphones, and other technology (Jünger, 2019; Keusch et al., 2019). Such data have the potential for offering real-time analysis of migrants and offer a wide range of analyses. This all the while that such apps or technology can offer services to those who use them, such as hosting information or offering translation. Like administrative records, such harvested data will need to be seen in light of the original purpose of collection versus the use for studying migration and immigrant settlement. For example, an app that hosts immigrant settlement information may capture geo-location data to help provide relevant information to the user. Such data can also be used to study mobility patterns or be linked to other geo-spatial data, extending uses beyond the original intent.

To fully maximize such information, issues over data standards, definitions and protocol are important. They are all key to linking across data sources. Yet another consideration around using such data is ethics and how it complies with privacy laws in different jurisdictions (Scassa, 2019). Concerns, for instance, are already being raised in the European Union over governments using smartphone data to identify undocumented migrants and use the information as a weapon to deport refugees (Meaker, 2018). As with all data, in the wrong hands, such data could be used to the detriment of those most vulnerable. For this reason, national governments have an important role to play in creating common data practices and protocols that weigh the necessity for gathering private information and the proportionality in protecting privacy.

Settlement organizations and other NGOs also have access to information on those using their organization’s services or their members. They also have data they must collect for their funders, as governments and other funders have demanded transparency and accountability of organizations. Such data has already been used by Immigration, Refugees and Citizenship Canada, through its Immigration Contribution Agreement Reporting Environment (iCARE), which accesses the data to evaluate and assess the settlement of Syrian refuges to the country (e.g. IRCC, 2019). These data have not yet been used extensively in Canada for research, but this will soon change now that iCARE has been linked to the IMDB. In addition to this, data organizations often collect their own information on programs and services of their clients and those data could be an important tool for researchers. However, like with other sources, such data suffer from a lack of consistency in how key concepts and units of analysis are defined, and most organizations lack the financial and human resources needed to process this information. This is an area where academic researchers can play a significant role alongside national governments and statistic agencies in helping coordinate the broader national and international ecosystems.

4 Lessons Learned on the Obstacles to Overcome and Opportunities to Pursue

Across each data source there is a common set of issues that can and need to be addressed by the policy and research community. They are issues that affect all data landscapes, not just the Canadian immigration data ecosystem. A key obstacle is the need to create common data standards that afford consistent measures that have common definitions (de Beer et al., 2010). As noted above, definitions of key terms, units of analysis, and geographies commonly change over time. Additionally, different datasets have their own definitions for core concepts like migrant, immigrant, refugee and so forth. A scoping review of literature showed that, in research on child and youth refugees, there were over 200 different groupings of age to define children and/or youth (Pritchard et al., 2019). Such inconsistencies limit the ability to link data and to offer robust comparative analysis, without requiring a considerable amount of background work prior to analysis.

Another obstacle to overcome is data access. This is particularly the case for confidential government microdata, administrative data and data built by individual researchers. In all cases, issues of privacy create barriers to access. Common protocols could be developed for accessing administrative data internationally, nationally and sub-nationally. For researchers collecting their own surveys or compiling their own data, using information repositories, or offering open access to replication datasets, needs to become the norm. This will enable using data across a wider range of users and potentially be linked to other sources, which can help bridge the gaps they may have because of small sample size. Creating greater access to data will foster greater engagement of migration and immigration issues.

A third obstacle to overcome is responsibility for developing new data. The various aspects of data development (creating and testing questionnaires, collecting and cleaning data, generating documentation and derived variables, among other considerations) has largely been the responsibility of national statistics agencies. Increasingly, with the advent of administrative data, other Federal or national departments are contributing to the data development process. This can include disclosing data, defraying the costs of development. It also means that national statistics agencies will increasingly play the role of providing technical assistance in addition to data collection. They are well positioned to generate and advocate for data standards across a data landscape. The sheer volume of potential new data sources is likely to require the hiring of many more data scientists across all levels of government. It will also require a greater engagement of academic institutions to assist with the Herculean task of shaping the new data landscape. Academic institutions can and should become more centrally involved in the data development process and help with the generation of common protocols. With greater access also comes the need for improved numeracy and development of the skills needed to use different levels and types of data, especially for those aiming to use administrative data.

Another challenge is that most datasets, especially administrative data, are created for one purpose in mind and applying them to issues of migration and immigration will stretch those purposes. To date, the focus of administrative data has been on economic issues (Costigan et al., 2016) and individuals (Ramos & Bennett, 2019). Such focus is largely because of constraints of the data, which do not measure a range of other issues. In other words, one can only answer the questions that are asked or measure things that have data collected on them. To overcome this obstacle, data linkage to censuses, surveys, and other administrative datasets has been key. We would argue that such linkage can also happen to yet other sources, if researchers, government departments and national statistical agencies adopt a data spine model that recognize the need for national data hubs and common practices.

5 A Call for Creating Spines and Data Standards

In place of a conclusion, we offer a call to policy makers and researchers to continue to work towards creating national data spines and common data standards and practices. We make this plea because we believe it is a huge opportunity to meet the challenges of the current migration and immigrant settlement data landscape and can lead to improved research and policy-outcomes. The notion of a national data spine is to create a core data infrastructure that can be used to link across national statistics, survey and administrative records, and sub-national statistics as well as research conducted by those in the academic, NGO, and private sectors. The main way forward in this regard is for national statistical agencies to work in partnership with stakeholders across the data ecosystem so that all data can be linked to administrative and census records. This will allow smaller and more nimble surveys as well as a series of administrative records across areas to be connected. This will mean that data standard will need to be developed as well as common procedures and protocols. Those will also help navigate the ethical, privacy, and other challenges that arise from unprecedented development and access to information (Scassa, 2019). This is underway, to some extent, in Canada through the Secure Data Linkage Environment. It is focused on creating a space where information from individuals across data sources can be combined. However, it still has much work to do in terms of creating common measures and practices. Moreover, it also can be seen in the rapid data collection during the COVID-19 outbreak. Statistics Canada generated crowdsourcing data, for the first time in its history, and worked to use questions similar to those being launched by NGOS, and in consultation with stakeholders, that could potentially be linked through geographic units to the agencies other data. We believe that coordination of such issues and data will form a national data spine that will allow for a more robust portrait of people, while also minimizing the cost of analysis. Doing so will take advantage of the nimbleness of non-government sources, as they will be able to link to it, allowing a fluid and dynamic data landscape. As well, if countries adopt a set of common protocols, it will allow for better international comparisons. This is already being done in the European Union but could be extended to larger networks of countries for international or global standards. Perhaps one avenue would be to leverage international organizations, such as the United Nations, or trade pacts, such as the G7, or other unions, like NATO.

Similarly, it will be worth exploring common data management practices and legislative standards for ensuring ethical use of data and adequate and responsible protection of privacy. Doing so will allow for better linkage within countries as well as across them. Here, we believe national statistics agencies can play a key role in coordinating data ecosystems. The same role can be done transnationally through organizations like EUROSTAT or the OECD. In addition to focusing on creating, collecting and managing data, their mandates will also need to focus on helping promote the harmonization of goals and practices amongst researchers, and across all sectors. To do so will mean that such agencies, and the research community that works with them, will need to work to create new relationships, trust, and means of accessing data across the data system to promote fluid flow and exchange of information. This may seem lofty, however, the creation of common, or even international standards, in other sectors has sparked innovation and collaboration leading to better products and outcomes. We cannot see why doing the same with data construction and management would not have similar positive outcomes. Developing data spines and standards will truly tap into the extraordinary amount of data collected and offer stronger insights on the unprecedented migration and immigrant settlement the world is currently experiencing.