Text data mining on current newspaper articles from the United States with ProQuest TDM Studio

This paper introduces and reviews a novel digital resource and service that offers access to a wealth of newspaper data from the United States for Text Data Mining (TDM) and Natural Language Processing (NLP). Due to copyright restrictions, gaining access to relevant text corpora of this sort can be difficult. However, ProQuest TDM Studio and similar services provide researchers with access to data and resources that were previously unavailable for TDM to this extent. By using these tools, researchers can gain insights into current newspaper discourses that still have a tremendous impact on debates in society and on political decisions. After giving an account of the structural and procedural elements that are of relevance for TDM research projects when working with data providers, the paper describes the newspaper data that ProQuest TDM Studio makes available and reviews the ways in which this data can be examined via the tools that ProQuest TDM Studio’s research environment offers. After contrasting this setup with other data providers and their systems, the paper concludes with an analysis of the opportunities and challenges of working with data providers and research environments such as those provided by ProQuest TDM Studio.

Text Data Mining (TDM) (cf.Ignatow and Mihalcea 2017;Lemke et al. 2016) and Natural Language Processing (NLP) (cf.Raina and Krishnamurthy 2022) can be very powerful tools for research in communication and media studies, particularly if they are applied to interesting and relevant text corpora.A main focus in this field is the discourse happening in newspapers because this form of mass media still has a tremendous impact on debates in society and on political decisions (cf.Newman et al. 2021;Jarren and Vogel 2009).However, getting access to current newspaper articles for TDM is often challenging due to the copyright protection of these materials.Copyright holders such as publishers and newspapers have been wary of the speed and ease with which textual data can be copied and distributed on the web.In digital formats, their contents are hard to protect and it is difficult to restrict the utilization of the data to certain forms of utilization only.Consequently, this could threaten their ability to monetize their contents, a situation which has made TDM projects on current newspaper sources rather difficult.This is especially the case if research projects aim to analyze and compare several newspapers, instead of just one or two, in order, e.g., to get an overview over the discourse taking place on a given topic in an entire country.Individual data providers like Factiva (2023) or LexisNexis (2023) as well as research institutions such as the German National Library (DNB) have made newspaper articles available in digital formats for research projects in the past.However, these offerings often came with several restrictions that hindered the effective, large-scale application of TDM approaches to these materials, thereby obstructing researchers' efforts to analyze current newspaper discourses.
In this paper, I introduce and review a novel digital resource and service that offers access to a wealth of newspaper data for TDM.The platform ProQuest TDM Studio (2023) allows the computation of ProQuest databases (2023) that an institution has a subscription to through a cloud-based work environment (cf.Megwalu and Engelsen 2022).ProQuest databases contain hundreds of current as well as historical newspapers in digital formats and by operating in the cloud, this service circumvents many copyright issues and permits research on materials that previously have not been accessible to this extent for TDM.This resource has opened up new possibilities for a multitude of enquiries in the field of communication and media studies and promises interesting insights through concrete research projects as well as important stimuli for the field of (digital) discourse analysis as a whole (cf.Wiedemann and Lohmeier 2019; Lemke et al. 2016).
I will begin my review with a general account of the structural and procedural elements that are of relevance for TDM research projects when working with data providers in section two.This overview will help to identify key components for evaluating the features of ProQuest TDM Studio and other providers throughout the paper.In section three, I will describe the newspaper data that ProQuest TDM Studio makes available and the forms in which it does so.Furthermore, I will explain the ways this data can be examined via the tools that ProQuest TDM Studio's research environment provides.In the following discussion (section four), I will contrast this setup with other data providers in order to compare the features of these different approaches.I will conclude the paper with an analysis of the opportunities and challenges of working with data providers and research environments such as those provided by ProQuest TDM Studio.

Getting access to textual data for TDM
Working with textual data comes with several hurdles.As Megwalu and Engelsen (2022) assess, these include "identifying sources for data extraction, securing copyright permission, cleaning datasets, normalizing disparate data formats, dealing with gaps in coverage, finding appropriate software for analysis, as well as data storage and computing issues" (p.42).At times, these challenges can even halt a project entirely because one cannot get access to the relevant materials or process them to the extent required to answer the given research question.These obstacles get particularly concerning if copyright protected materials, such as current newspaper articles or blog posts, are relevant to the project (cf.Fiil-Flynn et al. 2022).Below, I present a diagram to illustrate some of the structural and procedural elements that are important for TDM projects (Fig. 1).In most cases, it is best to recognize and address these issues upfront before conducting such a project.I will explain the issues laid-out in this diagram in more detail in the following passages in order to evaluate Fig. 1 Important points to consider when setting up TDM projects K how ProQuest TDM Studio and the systems of other data providers deal with these important matters.In the end, the ways in which providers address these problems determines how useful and practicable they can be for individual researchers and their projects in the field of communication and media studies.

Sources
Working with data providers, the first question concerns the sources they have available for TDM: Which newspapers do they provide access to?Is this selection representative of the newspaper discourse in a given country or region or just a small fraction of the overall picture?What time frames of these newspapers are available for TDM? Do the archives go back over the last 20 years or are just the latest five years accessible?Do these time frames of availability differ from newspaper to newspaper or is the accessibility the same across the board?
These basic questions determine whether and to what extent a given research question can be addressed by the sources that the provider makes accessible as they indicate the possible corpora that can be compiled based on these materials.The last question in this section ("Is the documentation about the sources available and easily accessible?") points to the availability of documentation about the sources that a provider makes accessible.One would expect it to be the norm that such information is easily accessible through the website of the provider.However, experience shows that this information is often times difficult to come by: many providers do not have this information readily available on their website and even direct enquiries by email have not always led to a clear picture of what is accessible.I particularly encountered this problem when using Nexis Uni (2023), but partly also while working with Factiva (2023) and to a certain extent during the use of Constellate (2023).Extensive and easily accessible documentation regarding available sources is generally desirable.

Filtering
The second point concerns the filtering of the given materials.In other words, what mechanisms are provided to create specialized sub-corpora out of the overall data based on your research questions?Compiling these sub-corpora is crucial and determines the possible outcomes of the respective projects.Usually, providers' interfaces allow users to easily search for and select the contents of relevance to the given research question.In my case, I was searching for newspaper articles of a certain time span that mention specific keywords and this mostly worked well.However, the accuracy and intuitiveness of these search mechanisms can vary from provider to provider.Sometimes, rather laborious verification processes were required to determine if all of the intended texts were correctly identified and added to the corpus.This can be especially arduous when working with datasets from several distinct providers.

Download options
Options for downloading and processing the compiled text corpora differ greatly.These aspects can be used to determine in advance whether a service/provider is suitable for big data projects or not.Some providers allow for the download of many files at once (bulk download), whereas other systems force the user to download one file (e.g.newspaper article) at a time, thereby limiting the scope of the project that can be done with this provider.In addition, some providers have an explicit download limit, so even if users find ways to automatize the single download option, only a certain limit of articles can be downloaded (e.g.250 articles per user).Further limitations are indirectly imposed by the file formats that are available for downloading.Formats such as PDF, DOCX, or RTF most of the time require further preprocessing before TDM analysis can begin, whereas XML and JSON files allow for the relatively smooth and direct integration into subordinated workflows and are therefore often preferable for TDM projects.

Processing
The processing of the data generally occurs on a local machine after downloading the data or in cloud-computing environments hosted on the provider's servers.The advantage of the former is that the data remains with the user, on his or her machine, even after one stopped working with the service provider that granted access to these materials.This way, the data can easily be integrated into mixed-methods environments such as MAXQDA (2023), which simplifies qualitative coding and close reading.Particularly for smaller datasets based on PDF, DOCX, and RTF files, this may be the optimal way to carry out one's project.For large datasets, cloud-computing environments are superior to circumvent the copyright issues mentioned at the beginning of this article.Working with these kinds of systems also outsources the data storage as well as the actual computing to the provider.These environments therefore keep the user's machine free for other tasks which can be useful when working with large datasets.Furthermore, these systems are the reason why copyright protected materials are becoming available for TDM on a large scale in the first place: because the textual data is not and cannot be downloaded to local machines but remains with the provider, copyright holders need to worry less about their contents being copied and shared on the web.

Results
Very important for any research project is the ability to share results with the research community and the public at large.Here the question arises which kind of results one may export and publish, and in what ways.Generally, one can share derived data and snippets (short text examples and quotes) without issue.However, sharing larger passages or complete articles is more complex.In many cases, researchers are not allowed to share the datasets that they have compiled and downloaded with anyone else, in principle not even with their own research group (see footnote 10).Researchers doing TDM clearly enter a gray area at this point, and it becomes even K more challenging if we consider sharing whole datasets for peer-review processes.It is considered good practice in the fields of the computational social sciences and the digital humanities to share the datasets that an article or research project is based on (cf.Peter et al. 2020).However, for copyright protected materials this comes with many juridical uncertainties.In this regard, it is important to closely examine individual data providers' guidelines regarding the sharing of datasets and the restrictions that they put onto such processes.

Continuous access (sustainability)
The last point concerns access to the data after the (provisional) completion of the research.How and for how long will the data used in this project be available?This question becomes relevant if researchers want to return to the data to continue or expand their analysis.If the data was initially stored on their local machines, the issue of continued access is avoided but one must plan for storage and accessibility on one's own machines.With regard to results such as derived data and snippets, accessibility is typically unhindered, but becomes more complex if we consider complete datasets or even several of them.Storing these datasets safely for a longer period of time and keeping documentation about them up to date can be timeconsuming and may be neglected, which is often times eventually regretted very much.The situation is different if the data rests on the provider's servers.In this case, there might very well be options available to store this data, but most likely this requires continuous subscription to the service and the associated payment of fees.Depending on the individual project and the given funding situation, paying such costs may not be feasible after the (tentative) end of the project.Therefore, mid-and long-term access to results and datasets can pose a significant issue under these circumstances as well.

Data availability via ProQuest TDM Studio
Given the outline of questions raised in the diagram above and its explanation in section two, I will now evaluate ProQuest TDM Studio (2023; Megwalu and Engelsen 2022) as a resource and service for doing TDM analysis on copyright protected documents, particularly on current newspaper articles.In the project that introduced me to different data providers, I was interested in the discourse happening in US newspapers regarding questions at the intersection of society and technology, especially concerning the regulation of social media platforms and the impact of such legislation on hate speech and freedom of expression.More specifically, I was looking for newspaper articles on Section 230 of the Communications Decency Act (cf.Communications Decency Act 1996), a law that has been crucial for the development of the Internet as we know it (cf.Kosseff 2019).This focus meant that I was searching for current newspaper articles (instead of historical ones).The timeframe was determined by Section 230 having been enacted in 1996 and by the fact that a discussion about a potential reform of this piece of legislation arose around and after the January 6 United States Capitol attack in 2021 (cf.Raskin 2022).I aimed to analyze the national discussion about this law and was therefore trying to get access to the most important, nationwide daily newspapers from the United States and their coverage of the topic.This focus has obviously influenced my use of data providers and their resources and, accordingly, impacted this report.
To pursue my research on current newspaper articles from the United States, I turned to ProQuest TDM Studio.Access to this service was provided by the libraries at the Massachusetts Institute of Technology (MIT) through a cooperation with the Comparative Media Studies/Writing program (CMS/W). 1 As described above in Sect.2.1, the first and to an extent most important point for a TDM project of this kind is the issue of the available sources.ProQuest TDM Studio generally provides access to different publication types such as scholarly journals, trade journals, newspapers, reports, magazines, blogs and websites, conference papers and proceedings, and many more.However, the data that can be worked with on the platform and in the computational TDM environment depends on the ProQuest databases that the respective institution has a subscription to.In this sense, the data for my project was based on the MIT libraries' access to ProQuest databases and the volume and kind of data that other institutions have access to via ProQuest TDM Studio may therefore be different (see Megwalu and Engelsen 2022, p. 44).However, the newspaper data described here is available through ProQuest and cooperating with institutions that already have access to these kinds of databases is a promising option for research collaborations as I will explain later on.
For my project, the 2618 newspaper sources available to MIT's ProQuest TDM Studio subscription were of relevance. 2I was particularly interested in newspaper articles from the United States, which led me to focus on the more than 900 current newspaper sources available from this country.Among the available sources are regional as well as nationwide US newspapers, ranging from the Albuquerque Journal to the Washington Post.Not all of these sources are available as full texts and not all of them on an ongoing basis, meaning that new issues are continually added to the database.Nevertheless, many newspapers are available on an ongoing basis-especially the important ones.The date from which full texts are available varies from source to source, e.g. the Wall Street Journal is available from January 2, 1984 to the present, while the Washington Post is from December 4, 1996 onwards. 3Generally speaking, many of the essential nationwide US newspapers are accessible in their full-text versions for TDM from the end of the last century onwards.This makes comparative analyses of the topics and discourses in these papers possible for the last 20, if not for almost the last 40 years, and constitutes a very relevant resource for research in communication and media studies in general and US journalism studies in particular.Furthermore, in many cases ProQuest TDM Studio provides the printed version of a given paper's full text separately from the accessible full text of the online version of the respective paper.This feature allows researchers to differentiate between the printed and online versions and to easily compare the coverage between the two of them.Given the increase in importance and range of newspapers' online versions and the corresponding decline in sales of printed copies, this option adds interesting opportunities for new and stimulating research inquiries.
Taking the example of my own research, ProQuest TDM Studio allowed me to analyze the full texts of the Boston Globe, the Chicago Tribune, the Daily Herald, the Los Angeles Times, the New York Times, USA Today, the Wall Street Journal, and the Washington Post in the time frame relevant to my analysis , thereby providing a very good overview over the nationwide daily newspaper coverage in the United States.I also analyzed the full texts of the online versions of the New York Times, the Wall Street Journal, the Washington Post, and USA Today.A slight imbalance towards left-leaning sources became apparent with respect to the availability of US daily newspapers given the fact that five of the above-mentioned eight newspapers in my corpus are generally categorized as "left-center," "liberal," or having a "left-lean" (Boston Globe, Washington Post, USA Today, NYT, LA Times), while only three of them are considered "right-center" or "center" (Chicago Tribune, Daily Herald, WSJ). 4 However, the opportunity to also work with data from blogs and websites, e.g. from publications specialized on tech topics such as Wired, TechCrunch, or Engadget, enables researchers to broaden the political spectrum of analysis according to the given research question and interest.ProQuest TDM Studio's "Blogs, Podcasts, & Websites" category contains more than 700 sources from the United States,5 covering a myriad of topics and representing differing levels of expertise and professionality: ranging from amateur blogs and podcasts to established online news portals.More "independent" or "center-right" leaning sources may be added to the mix based on this feature.
When it comes to current German newspapers, the data availability is presently not as broad, at least based on the MIT libraries' subscriptions.ProQuest TDM Studio only provides access to seven distinct sources (Berliner Morgenpost, die tageszeitung [taz], Die Welt, Die Zeit, Frankfurter Allgemeine Zeitung [FAZ], Hamburger Abendblatt, Munich Eye) and with varying degrees of access to the full texts.Die Welt and taz are accessible from 2008 and 2009 onwards, but the FAZ only from 2017 on.Therefore, an overview over the German newspaper landscape, as intended in my project, would not be possible based on the data provided by ProQuest TDM Studio. 6ProQuest TDM Studio also provides access to historical newspapers and sources.However, based on the subscription of the MIT libraries there were no historical newspapers available from Germany.From the United States, about 50 different sources are obtainable, as well as a few sources from India and China published in English, but here the time frames of availability differ greatly.Additionally, these resources are not available as full texts for TDM, as current newspaper data is, but come as PDFs for inspection and downloading and therefore potentially require further pre-processing as discussed above in Sect.2.3.
Overall, the data availability for the United States regarding current newspaper articles is extensive and allows for the analysis of much of the discourse in daily nationwide newspapers based on the computation of the full texts of these papers' articles.This generates very promising opportunities for the research of discourses happening in US newspapers based on TDM and NLP, especially since such a wealth of data was not digitally available before, particularly not combined in one unified analysis environment.On the basis of these large specialized text corpora and the accompanying meta data, it becomes possible to analyze and structure the discourse in this particular subfield of the public sphere by drawing on research methods such as frequency, co-occurrence and network analyses, topic modeling, named entity recognition, text clustering and outlier detection, as well as through the automated analysis of more complex argumentations based on machine learning algorithms (cf.van Atteveldt et al. 2021). 7These quantitative methods certainly have their limitations and may privilege the rather "superficial" inspection of text and discourse from a birds-eye-view (see Grimmer and Stewart 2013), but precisely because of that they allow for the structuring of large amounts of text and for the identification of relevant sections for close reading and detailed analysis.Such a "blended reading" approach (Stulpe and Lemke 2016), i.e., the combination of quantitative and qualitative analysis, uses the above-mentioned procedures of "distant reading" (Moretti 2013) to detect structures and anomalies (cf.Moretti 2017), which can then be investigated by "close reading" (Smith 2016) and qualitative analysis and allows for the application of information from outside the text for further interpretation based on the initial quantitative findings.My own research follows such a pathway and aims to pinpoint the most influential stakeholders within the discourse about Section 230 CDA to examine their communication strategies and the arguments that they bring forward.It asks what the challenges are that these actors detect with regards to the impact of digital technologies and their regulation on democracy and how they frame these issues.ProQuest TDM Studio and NLP methods alone cannot deliver answers to all of these questions, but in many regards they are a good starting point for my inquiries.
In the following section, I will examine the research environment that the platform offers for the analysis of the data more closely.

The ProQuest TDM Studio research environment
The ProQuest TDM Studio research environment provides a "Workbench" section that enables users to run code in Jupyter Notebooks (Jupyter 2023) as well as a "Visualization" section which allows for the generation of visualizations of the chosen data based on geographic analysis, topic modeling, and sentiment analysis.While these pre-made visualization tools can be helpful for obtaining an initial overview of the data and for determining further steps for analysis, they come with several restrictions as users are only allowed to create up to five different projects, select only 10 different publications (e.g.newspapers) for their analysis and since they cannot download images from the "Visualization" dashboard, but have to take screenshots of them for their use (see Megwalu and Engelsen 2022, p. 45).In their review, Megwalu and Engelsen point out additional problems with these features based on the nature of algorithmic text mining which can, for example, lead to the misidentification of common place names and therefore the skewing of results in the geographical analysis option (2022, p. 45).However, these pre-made visualization tools might be effectively used in teaching to introduce students to working with large datasets, as Megwalu and Engelsen suggest (2022, p. 45).
In contrast to the "Visualization" section, the "Workbench" section allows researchers to dig deeper and more selectively into the data by using pre-existing scripts supplied in the environment or the user's own code in Python or R. The aforementioned restrictions regarding the number of projects and the selection of publications for the pre-made visualization tools do not apply here, so that the data can thoroughly be investigated.Diagrams can either be generated directly in the Jupyter Notebook environment or after the downloading of derived and meta data with other tools.This downloading of derived and meta data also allows for the further processing of the data with other analysis tools such as Gephi (2023) for network analysis or basically any other relevant instrument that can handle and interpret this data in a useful manner depending on the individual researcher's interests and research question.Only the full texts cannot be downloaded and then be processed with other tools, but must (largely) remain within the environment.
The starting point for each project is the compilation of the data that the user wants to work with.On the landing page of the "Workbench" dashboard users can create a new dataset based on their choice of publications, e.g.different newspapers or contents from other ProQuest databases (Fig. 2).Note that there is a limit of ten datasets that can be created and stored in the workbench environment at any given time, but users can generate more by deleting datasets already uploaded to and thereby stored in the Jupyter notebook environment to make space for new datasets in the workbench.There is no restriction for storing datasets in the notebook environment.
At the outset, users can pick either one or several sources and then apply different filters to the sources and their contents.Within the chosen sources, e.g.all available articles by the Boston Globe, you can then further refine the sample on the side Fig. 2 ProQuest TDM Studio User Interface with a view of the workbench dashboard of your screen by limiting the date range, selecting one or more specific document types (e.g.newspaper) and selecting one or more specific source types (in the case of newspapers e.g."News," "Commentary," "Review," "Editorial," etc.).For my project, I selected individual US newspapers and created datasets based on all articles that featured given keywords in the time frame of relevance.With the search bar there is a wide range of functionality to filter one's datasets by the text within the documents (Fig. 3).The operators include AND, OR, NOT, NEAR/N, EXACT, and LINK.8As already mentioned above, the option to differentiate between the printed and the online versions of certain newspapers enables the creation of two distinct datasets for, e.g., the Boston Globe based on the given criteria, namely all articles that appeared in print exhibiting the relevant keywords and then all articles that were published in its online version in the second, and to save both of these datasets separately to the workbench for computation.One can also generate datasets based on several newspapers or a mix of sources (blogs, reports, academic journal articles, magazine articles, etc.) as long as the resulting dataset does not extend the limit of two million documents per dataset.
As soon as the dataset is processed by the system, it can be accessed within a Jupyter Notebook and is ready for analysis with Python or R. It is important to note that all newspaper data comes in the same unified data structure enabling the effortless comparative analysis of articles by different newspapers.The articles are accessible as XML files and relevant meta data such as the title, subtitle, publication date, authors, wordcount, newspaper specific tags, and many more can be extracted and then used for analysis.Results of the analysis such as derived data or visualizations generated in the notebooks can be exported and downloaded to the user's local machine.However, one has to keep in mind that there is a 15 MB download limit per week which might constrain certain projects.The actual texts of the articles or sources compiled in these datasets can be examined with more sophisticated Fig. 3 The search bar allows searching datasets and sources by keywords to compile corpora NLP methods such as topic modelling, named entity recognition, sentiment analysis, or by the application of machine learning algorithms to identify more complicated arguments, which requires more in-depth coding skills.ProQuest TDM Studio provides a few scripts for these types of procedures for non-coders and the support team is ready to help with concrete questions regarding these or newly to be developed scripts, but working with your individually written and customized code is surely ideal.It gives the user more oversight over the computational procedures, particularly as the existing scripts come with scarce documentation (see Megwalu and Engelsen 2022, p. 44).
The fact that the full texts, on which these computations are performed, remain on ProQuest's servers and are not downloaded to the local machine of the user enables this copyright protected textual data to be made available in the first place as mentioned above.This setup essentially delimits the danger for newspapers and publishers to lose out on potential profits due to their contents becoming freely available online.At the same time, it facilitates the examination of large amounts of textual data with TDM techniques by users without burdening their own computational resources such as storage space and computing power.However, ProQuest TDM Studio's very beneficial setup also comes with drawbacks when it comes to options for the inspection of the texts by close reading.The system allows users to download individual XML files, but each of them has to be requested separately.This process is cumbersome and automatically prevents users from downloading large numbers of articles.It is clear that the system is neither engineered nor intended for the mass download of the materials.There is also the already mentioned 15 MB download limit per week, which includes derived data and visualizations as well.All this means that individual checks of the textual data by close reading are very much possible, but reading larger amounts of documents may turn into a challenge.Furthermore, there is no option to analyze the data in a mixed-methods environment comparable to MAXQDA (2023) or ATLAS.ti(2023).As described above, one can neither download larger amounts of textual documents and integrate them into such systems on one's local machine, nor does ProQuest TDM Studio Fig. 4 Features of ProQuest TDM Studio K provide its own software solution for a more comfortable reading of the texts or an easily manageable tagging and analysis system.Adding functions of such kind to the existing structure would make ProQuest TDM Studio even more attractive for researchers in the social sciences.
As discussed in section two, the sharing of results and data is crucial for research projects.Particularly in the fields of the computational social sciences and the digital humanities there is a growing understanding that the datasets constituting the basis of research projects should be shared and made available for peer review and inspection by the research community (cf.Peter et al. 2020; Journal of Open Humanities Data 2023).However, it is often very difficult to grant peers or the public at large access to the data researchers are working on due to copyright protection and license agreements that data providers such as libraries and research institutions have engaged in with the copyright holders.In this regard, it is particularly worth mentioning that ProQuest TDM Studio provides the opportunity to give peer-reviewers access to the workbench and data a researcher or research team has been working with in order to allow for the inspection of the calculations and the underlying data.9This feature seems to be a very convenient solution and a promising step forward towards more transparency and verifiability in TDM research and may become an ever more important asset in the future.It is surely not as ideal as sharing complete datasets in an open data manner, but given the current status of copyright protection and the necessity to finance quality journalism under the circumstances of the digital age in somehow this setup seems to be an improvement.Figure four synthesizes the gathered insights and depicts the features of ProQuest TDM Studio in relation to the structural and procedural elements that were identified as relevant to TDM research projects at the beginning of this article (Fig. 4).

Discussion: advantages and disadvantages of ProQuest TDM Studio
The advantages of ProQuest TDM Studio are evident: first and foremost, the service provides access to a wealth of copyright protected newspaper data from the US for TDM that has not been accessible in this way before.The ability to analyze several major US daily newspapers as well as many regional papers and other additional sources like blogs, conference proceedings, academic journals, magazines, etc. all in one environment has great potential for exploring research questions in communication and media studies in general and journalism studies in particular.Filtering sources to compile datasets and specific text corpora for examination is easy, intuitive, and fast.Furthermore, the Jupyter Notebook environment described above enables the convenient application of powerful analysis tools to the full texts of these materials based on NLP methods.Downloading results such as derived data, meta data, and visualizations as well as working on the data with a team of collaborators, potentially based at different locations, is supported by the system and facilitates joint research projects very well.The underlying common data structure of the available current newspaper sources based on XML files releases researchers from the burden of further preprocessing the data before the analysis can begin and allows for the comparative analysis of different newspaper sources.Moreover, the cloud-based solution for the computing of the data not only circumvents many copyright issues, but also keeps users' local machines unladen of the storage and the actual processing of the data.Furthermore, options provided by ProQuest TDM Studio to share the code used during a research project as well as the underlying data with peer-reviewers seem to be a good step forward towards more transparency and verifiability in TDM work on copyright protected materials.In this sense, the service is well setup to facilitate groundbreaking quantitative research in the abovementioned fields.
These advantages of ProQuest TDM Studio particularly come to the front if compared to other, already existing services and data providers.Working with, for example, Nexis Uni (2023), I encountered many restrictions that ProQuest TDM Studio circumvents.Firstly, using Nexis Uni users only have the option to download newspaper articles as PDFs, DOCs, or RTFs, which necessitates further preprocessing before one can analyze the files well by using TDM techniques.Furthermore, on Nexis Uni there is a download limit of 100 full text articles and the individual user is not allowed to share the data or publish results.These and other stipulations hinder the research process and may prevent projects depending on the respective research questions and the tools and procedures necessary to answer them. 10ProQuest TDM Studio, in contrast, makes many of these examinations possible, or even easy.On the other hand, one should keep in mind that Nexis Uni's setup has some advantages over ProQuest TDM Studio, notably the option to download the data to the user's local machine and use mixed-methods software for its analysis there, something that ProQuest TDM Studio does not provide.It also needs to be mentioned that other providers such as NexisLexis (Nexis Data Lab), JSTOR (Constellate) or Gale (Gale Digital Scholar Lab) offer new services similar to those of ProQuest TDM Studio too.Some US university libraries have setup their own similar infrastructures as well.The differentiating factors between these offerings are the databases that they can draw on and therefore the content that they can provide to their users for TDM (see Megwalu and Engelsen 2022, p. 45).
In a different research project on the current internet policy discourse in German newspapers, I collaborated with the German National Libraries (DNB) to access the most important daily newspapers for TDM.The holdings of the DNB are extensive and I was able to gather the textual data of these papers regarding our specific research question through the DNB.However, due to legal restrictions, we had to work on these materials on the premises of the DNB and cut off from the Internet, which complicated the computational analysis a lot (Pohlmann et al. 2023).Furthermore, 10 I received this list of guidelines regarding data use and TDM via an email correspondence with the Nexis Uni customer service: "Nexis Uni Academic services are only for personal academic use.Downloading of large volumes of data for use in text mining applications or with other automated trend analysis software is not permitted.There is a download limitation of 100 full text articles and 250 via results list.Downloading text to create a corpus of text for analysis is not permitted.Storage of Nexis Uni data in a shared archive is not permitted.Access to Nexis Uni requires manual use of the Nexis Uni interface.Using software or other automated tools to systematically download licensed content is not permitted".the output format of the articles we searched for was in PDF files that represented the complete newspaper page on which our keywords were found.This meant that the relevant articles had to be separated from the rest of the text on the respective pages in a laborious and error-prone process.In this respect, it becomes obvious that a cloud-based setup like the one from ProQuest TDM Studio would be very beneficial for accessing and computing copyright protected materials from Germany for research purposes, especially since ProQuest TDM Studio only provides very limited access to German newspapers as outlined above.
Challenges concerning the use of ProQuest TDM Studio can arise from the proprietary nature of the service.This may particularly come into play after the actual research has already been finished as explained in section two.Continuous access to the datasets, the analyses done, and the code run during the project depend on uninterrupted subscription to ProQuest's service and may therefore lead to a dependency on the platform, at least if users want to return to the research done to use and edit it later on.This could be especially troubling if a researcher moves to a new institution that does not subscribe to ProQuest TDM Studio or not to the particular ProQuest databases that the initial research was based on.Especially for younger scholars who have not found a permanent position yet and switch employers frequently this could turn into a major problem.Additionally, mid-and longterm storage of and accessibility to the data could become difficult if the initially quite reasonable prices are raised at some point in the future. 11From the viewpoint of sustainability, these matters are important to keep in mind.It is also obvious that this potential lock-in effect runs counter to open source and open access efforts championed by the research community in general (cf.De Silva and Vance 2017;Schimmer et al. 2015) and European funding agencies in particular (cf.European Commission 2019; Deutsche Forschungsgemeinschaft 2023).
Given that access to ProQuest TDM Studio and to the data and resources that this service provides facilitate research on materials that have not been available for TDM in this way before, it is likely that research projects that draw on this service (or similar products) can produce new insights that are relevant to the respective fields and will yield publications.From a researcher's perspective, it seems apparent that one would like to use such a platform to explore questions regarding the discourse happening in US newspapers as well as in other sources such as blogs/ websites and academic journals, at least if one is interested in working with quantitative approaches and NLP methods.Therefore, researchers will likely push for the subscription to such services or flock to institutions that have the means to pay for research infrastructures of this kind.In the end, this may lead to a greater divide between universities and institutions that can afford such services and those who cannot, and this may be considered a major problem.However, it also needs to be noted that "ProQuest recognizes the practice of research collaborations across institutions," as Megwalu and Engelsen put it (2022, p. 46).This means that researchers from different institutions can become part of a research group at an institution that subscribes to ProQuest TDM Studio's services and thereby get added to the group's workbench, as it happened in the project described in this paper.This seems to be an option that should be explored further and made use of, particularly in transatlantic research collaborations.
From the perspective of German and European communication and media studies, one would wish to have a similar service available for German and/or European newspapers in order to analyze them as outlined above.In particular, the opportunity to examine a sample of different newspapers that largely represents a country's complete newspaper landscape with TDM techniques in one working environment is very advantageous and promises the generation of new and important stimuli for discourse analysis.One way to make this possible would be if existing services such as ProQuest TDM Studio expanded their offerings by including more German and European materials.It would also be imaginable that a German/European equivalent to ProQuest TDM Studio, meaning a for-profit platform to provide such services, arises.On the other hand, it would also be possible that a state funded portal, drawing on the already existing infrastructures at German libraries and research institutions as well as on the attempts to develop a National Research Data Infrastructure Germany (cf.NFDI 2023) would deliver such materials and services.However, especially in the latter case, the protective behavior of newspapers and publishers regarding their data might make such an undertaking difficult, even if the materials are only used for research purposes, as the past has shown.With regards to the former option, the question is if such a platform would be economically feasible and how the business model could look like.Nevertheless, from a researcher's perspective it would clearly be desirable to have access to such a service.from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.