Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks—based on available literature—along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.


Introduction
The immense popularity gained by social media in the last decade has made it an attractive source of data for a large number of research fields and applications, especially for sentiment analysis and opinion mining (Balahur 2013;Severyn et al. 2016).In order to successfully process the data available from such sources, linguistic analysis is often helpful (Vilares et al. 2017;Mataoui et al. 2018), which in turn prompts the use of NLP tools to that end.Despite the ever increasing number of contributions, especially on Part-of-Speech tagging (Gimpel et al. 2011;Owoputi et al. 2013;Lynn et al. 2015;Bosco et al. 2016;Çetinoglu and Çöltekin 2016;Proisl 2018;Rehbein et al. 2018;Behzad and Zeldes 2020) and parsing (Foster 2010;Petrov and McDonald 2012;Kong et al. 2014;Liu et al. 2018;Sanguinetti et al. 2018), automatic processing of usergenerated content (UGC) still represents a challenging task, as is shown by some tracks of the workshop series on noisy user-generated text (W-NUT).1 UGC is a continuum of text sub-domains that vary considerably according to the specific conventions and limitations posed by the medium used (blog, discussion forum, online chat, microblog, etc.), the degree of "canonicalness" with respect to a more standard language, as well as the linguistic devices2 adopted to convey a message.Overall, however, there are some well-recognized phenomena that characterize UGC as a whole (Foster 2010;Seddah et al. 2012;Eisenstein 2013), and that continue to make its treatment a difficult task.
As the availability of ad hoc training resources remains an essential factor for the analysis of these texts, the last decade has seen numerous resources of this type being developed.A good proportion of those resources that contain syntactic analyses have been annotated according to the Universal Dependencies (UD) scheme (Nivre et al. 2016), a dependency-based annotation scheme which has become a popular standard reference for treebank annotation because of its adaptability to different domains and genres.At the time of writing (in UD version 2.6), as many as 92 languages are represented within this vast project, with 163 treebanks dealing with extremely varied genres, ranging from news to fiction, medical, legal, religious texts, etc.This linguistic and textual variety demonstrates the generality and adaptability of the annotation scheme.
On the one hand, this flexibility opens up the possibility of also adopting the UD scheme for a broad range of user-generated text types, since a framework which is proven to be readily adaptable is more likely to fit the needs of diverse UGC data sources, and the wealth of existing materials makes it potentially easier to find precedents for analysis whenever difficult or uncommon constructions are encountered.On the other hand, the current UD guidelines do not fully account for some of the specifics of UGC domains, thus leaving it to the discretion of the individual annotator (or teams of annotators) to interpret the guidelines and identify the most appropriate representation of these phenomena.This article therefore draws attention to the annotation issues of UGC, while attempting to find a cross-linguistically consistent representation, all within a single coherent framework.It is also worth pointing out that inconsistencies may be found even among multiple resources in the same language (see e.g.(Aufrant et al. 2017;Björkelund et al. 2017)) 3 .Therefore, even on the level of standardizing a common solution for UGC and other treebanks in one language, some more common guidance taking UGC phenomena into account is likely to be useful.This article first provides an overview of the existing resources -treebanks in particular -of user-generated texts from the Web, with a focus on comparing their varying annotation choices with respect to certain phenomena typical of this domain.Next, we present a systematic analysis of some of these phenomena within the context of the framework of UD, surveying previous solutions, and propose, where possible, guidelines aimed at overcoming the inconsistencies found among the existing resources.
Given the nature of the phenomena covered and the fact that the existing relevant resources only cover a handful of languages, we are aware that the debate on their annotation is still wide open; this article therefore has no prescriptive intent.That said, the proposals in this article represent the consensus of a fairly large group of UD contributors working on diverse languages and media, with the goal of building a critical mass of resources that are annotated in a consistent way.As such, it can be used as a reference when considering alternative solutions, and it is hoped that the survey of treatments of similar phenomena across resources will help future projects in making choices that are as comparable as possible to common practices in the existing datasets.
2 Linguistics of UGC Describing all challenges brought about by UGC for all languages is beyond the scope of this work.Nevertheless, following Foster (2010); Seddah et al. (2012); Eisenstein (2013) we can characterize UGC's idiosyncrasies along a few major dimensions defined by the intentionality or communicative needs that motivate linguistic variation.It should be stressed that one and the same utterance, and indeed often a single word, can instantiate multiple categories from the selection below, and that their occurrence can be either intentional or unintentional. 4igure 1 displays the hierarchy we followed to describe UGC phenomena."Canonicalness" refers to whether a phenomenon is also observable in standard text."Intentionality" refers to whether its production was deliberate.5 "Type" refers to the variety of the phenomenon, while "Subtype" provides sub-categorization of each type.
-Encoding simplification: This axis covers ergographic phenomena, i.e. phenomena aiming to reduce the effort of writing, such as diacritic or vowel omissions (EN people → ppl).-Boundary shifting: Some phenomena affect the number of tokens, compared to standard orthography, either by replacing several standard language tokens by only one, which we will refer to as contraction (FR n'importe → nimp 'whatever, rubbish') or conversely by splitting one standard language token into several tokens, which we will refer to as over-splitting (FR c'était → c t, 'it was').In some cases, the resulting non-standard tokens might even be homographs of existing words, creating more ambiguities if not properly analyzed.Such phenomena are frequent in the corpora of UGC surveyed below, and they require specific annotation guidelines.-Marks of expressiveness: orthographic variation is often used as a mark of expressiveness, e.g., graphical stretching (yes → yesssss), replication of punctuation marks (? → ?????), as well as emoticons, which can also take the place of standard language words, e.g. a noun or verb (FR Je t'aime → Je t'<3, 'I love you', with the heart emoticon representing the verb 'love').These phenomena often emulate sentiment expressed through prosody, facial expression and gesture in direct interaction; however the written nature of UGC data means that they need to be assigned analyses in terms of tokens, parts of speech and dependency functions.Many of the symbols involved also contain punctuation, which can lead to spurious tokenization and problems in lemmatization (see below).-Foreign language influence: UGC is often produced in highly multilingual settings and we often find evidence of the influence of foreign language(s) on the users' text productions, especially in codeswitching scenarios, in domain-specific conversations (video game chat logs) or in the productions of L2 speakers, all of which complicate the typically monolingual context for which syntactic annotation guidelines are developed.In some cases, foreign words are imported as is from a donor language (e.g.IT non fare la bad girl 'don't be a bad girl' instead of non fare la cattiva ragazza).In other cases, foreign influence can create novel words: a good example is an Irish term coined by one user to mean 'awkward', áicbheaird, whose pronunciation mimics the English word (instead of the equivalent standard Irish term amscaí).-Medium-dependent phenomena: Some deviations from standard language are direct results of the electronic medium, including client-side automatic error correction, masking or replacement of taboo words by the server, artifacts of the keyboard or other user input devices, and more.In some cases, and especially for languages other than English, some apparent English words in UGC represent automatic 'corrections' of non-English inputs, such as Irish coicíse 'fortnight' → concise.These cases raise questions  1.
relating to the degree of interpretation, such as reconstructing likely UGC inputs before error correction, which may need to be annotated either as typos (in UD, the annotation Typo=Yes), or at an even greater level of detail in lemmatization.-Context dependency: Given the conversational nature of most social media, UGC data often exhibits high context-dependence (much like dialogue-based interaction).Speaker turns in UGC are often marked by the thread structure in a Web interface or app, and information from across a thread may provide a rich context for varying levels of ellipsis and anaphora that are much less frequent or complex in standard written language.In addition, multimedia content, pictures or game events can serve as a basis for discussion and are used as external context points, acting, so to speak, as non-linguistic antecedents or targets for deixis and argument structure.This can make the annotation task more difficult and prone to interpretation errors, especially if the actual thread context is not available, and mandates some conventional guidelines.Table 1 presents some cross-language examples of several of the phenomena outlined above.The elements in boldface in Figure 1 are exemplified in Table 1.

UGC Treebanks: An Overview
In order to provide an account of the resources described in the literature, we carried out a semi-systematic search on Google Scholar using the following set of keywords (treebank web social media) and (universal dependencies web social media), limiting to the first five pages, sorted by relevance, and without time filters.
We selected only open-access papers describing either a novel resource or an already-existing one that has been expanded or altered in such a way that it gained the status of a new one.In the few cases of multiple publications referring to the same resource, we chose the most recent one, assuming it contained the most up-to-date information regarding the status of the resource.We also included in our collection five papers that we were aware of, but which were not retrieved by the search.As the main focus of this work is on the syntactic annotation of web content and user-generated texts, we discarded all papers that presented system descriptions, parsing experiments or POS-tagged resources (without syntactic annotation).The results of our search are summarized in Table 2.
Based on the selection criteria mentioned above, we found 21 papers describing a resource featuring web/social media texts; most of them are freely available, either from a GitHub/BitBucket repository, a dedicated web page or upon request.Dataset sizes vary widely, ranging from 500 (DWT) to approximately 6,700 tweets (Pst) for the Twitter treebanks, and from 974 (xUGC) to more than 16,000 sentences (EWT) for the other datasets.

Languages
English is the most represented language, however, some of the resources focus on different English language varieties such as African-American English (TAAE), Singaporean English (STB), and Hindi-Indian English code-switching data (Hi-En-CS).Three resources are in French (Frb, xUGC, FSMB), one includes codeswitching data in French and transliterated dialectal North-African Arabic (NBZ) and two in Italian (TWRO, Pst); the remaining ones are in Arabic (ATDT), Chinese (CWT), Finnish (TDT), German (tweeDe) and Turkish (ITU).While the current Irish Twitter corpus, which is the source of some of the examples in Table 1, has not yet been converted to treebank format (and as such is not listed in Table 2), its annotation presented most of the same challenges that make up the discussion below (Lynn et al. 2015;Lynn and Scannell 2019).

Data sources
13 out of 21 resources are either partially or entirely made up of Twitter data.Possible reasons for this are the easy retrieval of the data by means of the Twitter API and by the use of wrappers for crawling the data, as well as the policy adopted by the platform with regard to the use of data for academic and non-commercial  Table 2 Overview of treebanks featuring user-generated content from the web, along with some basic information on the data source, the languages involved and whether they are based on UD scheme or not.In non-UD treebanks, ‡ and indicate, respectively, a constituency or dependency-based syntactic representation.(AR:Arabic, DZ/FR: Dialectal North-African Arabic/French code-switching, HI/EN:Hindi-English code-switching, AAE:African-American English, MAE:Mainstream American English, IT:Italian, EN:English, FR:French, FI:Finnish, TR:Turkish, DE:German, SgE:Singapore English, ZH:Chinese).
purposes. 6Only three resources include data from social media other than Twitter, specifically Facebook (FSMB), Reddit (GUM) and Sina Weibo (CWT), and, overall, most of the remaining resources comprise texts from discussion fora of various kinds.Only three treebanks consist of texts from different sub-domains, i.e. newspaper fora (NBZ), blogs, reviews, emails, newsgroups and question answers (EWT), and Wikinews, Wikivoyage, wikiHow, Wikipedia biographies, interviews, academic writing, Creative Commons fiction (GUM).One resource is made up of generic data automatically crawled from the web (TDT).

Syntactic frameworks
With regard to the formalism adopted to represent the syntactic structure, dependencies are by far the most used paradigm, especially among the treebanks created from 2014 onwards, though some resources include both constituent and dependency syntax versions -EWT has manually annotated constituent trees, while GUM contains automatic constituent parses based on parser output from CoreNLP (Manning et al. 2014) applied to the gold POS tags.As pointed out by Martínez Alonso et al. ( 2016), dependency-based annotation lends itself well to noisy texts, since it is easier to deal with disfluencies and fragmented text breaking conventional phrase structure rules, which prohibit discontinuous constituents.
The increasing popularity of UD may also have a role in the prevalence of dependencies for web data, considering that 14 out of the 18 dependency treebanks are based on the UD scheme.Although not all of these corpora have been released in the official UD repository, and some of them do not strictly comply with the latest format specifications, the large number of UD resources, as well as their occasional divergences, highlight the need to converge on a single syntactic annotation framework for UGC within UD, to allow for a better degree of comparability across the resources and arrive at tested best practices.
In the next section, we provide an analysis of the guidelines of the surveyed treebanks, highlighting their similarities and differences, and a preliminary classification of the phenomena to be dealt with in UGC data from social media and the web with respect to the standard grammar framework for each language.

Annotation Comparison
To explore the similarities and divergences among the resources summarized in Table 2, we carried out a comparative analysis of recurring annotation choices, taking into account a number of issues whose classification was partially inspired by the list of topics from the Special Track on the Syntactic Analysis of Non-Canonical Language (SPMRL-SANCL 2014).7These issues include: sentential unit of analysis, i.e. whether the relevant unit for syntactic analysis is defined by typical sentence boundaries or other criteria tokenization, i.e. how complex cases of multi-word tokens on the one hand and separated tokens on the other are treated domain-specific features, such as hashtags, at-mentions, pictograms and other meta-language tokens.The information on how such phenomena have been dealt with was gathered mostly from the reference papers cited in Table 2, and, whenever possible, by searching for the given phenomena within the resources themselves.

Sentential unit of analysis
Sentence segmentation in written text from traditional sources such as newspapers, books or scientific articles is usually defined by the authors through the use of punctuation.However, this is frequently not the case with UGC on social media. 8Often, punctuation marks may be missing, misapplied relative to the norms of written language, or used for other communicative needs altogether (e.g.emoticons such as :-|, or emoticons simultaneously serving as closing brackets, etc.).In some cases, no punctuation is used whatsoever, as in Example 1 (the non-standard translation and spelling approximates the lack of punctuation in the original German text).
(1) Haben Menschen eigentlich nichts besseres zu tun als Suzie Grime zu haten ja einige Aktionen sind ehrenlos ich habs verstanden Don't people have anything better to do than to hate on Suzie Grime yes some things people do are a disgrace I gettit Against this background, it is a non-trivial task to segment social media text manually, let alone automatically.Given that many social media posts by private users tend to consist of sequences of short phrases, clauses and fragments, it is understandable that many Twitter resources consider the entire tweet as a basic unit -though for other, longer sources, such as Reddit, using entire posts as utterances by analogy is not feasible.Further, certain types of annotations make retaining tweets as single segments more conducive.For instance, TWRO analyzed the syntactic/semantic relationships and ironic triggers across different sentences, which was more practical with tweets kept intact.In addition, annotation of inter-sentential code-switching (see Section 4) can be considered more appropriate at the tweet level.Finally, keeping tweets as single units in some treebanks saves the effort needed to develop, maintain, adapt or do post-processing on an automatic sentence segmenter. 9n the other hand, there are counterbalancing considerations that motivate performing medium-independent segmentation on UGC data, among these a possible overuse of syntactic relations that define side-by-side (or run-on) sentences (e.g.parataxis in UD); second, as mentioned previously, at least for some UGC data collections (e.g.blog posts), punctuation is found frequently enough and can be used.Third, given that Twitter doubled its character limit for posts from 140 to 280 at the end of 2017, treating tweets as single utterances might pose a usability problem for manual annotation.Finally, for NLP tools trained on multiple genres and for transfer learning, inconsistent sentence spans are likely to reduce segmentation and parsing accuracy.
Due to these considerations, tweeDe manually segmented tweets into sentences while introducing an ID system that enables reconstruction of complete posts, if needed.Similarly, GUM uses syntactic utterance level annotations of user IDs and addressee IDs to indicate the post-tree structure in Reddit forum posts.The CoNLL-U format used in the UD project provides the means to implement these kinds of solutions in a straightforward manner, using utterance level comment annotations, which are serialized together with each syntax tree.tweeDe, however, still features the use of the parataxis relation within a single utterance for juxtaposed clauses that are not separated by punctuation, even when they form multiple complete sentences, similar to the analysis one would find in newspaper treebanks.
For other cases authors have introduced additional conventions to cover special constructs occurring in social media.For instance, in some treebanks (sequences of) hashtags and URLs are separated out into 'sentences' of their own whenever they occur at the beginning or end of a tweet and do not have any syntactic function.
A third option besides not segmenting and segmenting manually is, of course, to segment automatically.In the spirit of maintaining a real-world scenario, Frb split their forum data into sentences using NLTK (Bird and Loper 2004), with no post-corrections.Accordingly, the resource contains instances where multiple grammatical sentences are merged into one sentence due to punctuation errors such as a comma being used instead of a full stop, as in Example 2. Conversely, there are cases where a single sentence is split over multiple lines, resulting in multiple sentences (Example 3) that are not rejoined.
(2) Combofix will start, When it is scanning don't move the mouse cursor inside the box, can cause freezing.
(from Foreebank) (3) I'm sure the devs.can give you more details on this (from Foreebank)

Tokenization
Tokenization problems in informal text include a wide range of cases which can sometimes require a nontrivial mapping effort to identify the correspondence between syntactic words and tokens.We may thus find multiple words that are merged into a single token, as in contractions10 (Example 4, which is also frequent in spoken English and can also be found in literary texts but not in newswire or academic writing) and initialisms such as the Italian example in (5), or, conversely, a single syntactic word split up into more than one token (6-7 below).
(4) gonna ↔ going to (5) tvtb ↔ ti voglio tanto bene I love you so much We observed a number of different tokenization strategies adopted to deal with those cases but most of the time the preferred solution seemed to involve their decomposition (Twb2, xUGC, tweeDe, FSMB, EWT,11 GUM), although a few inconsistencies are found in the resulting lemmatization.Consider the contraction in Example 4. Twb2 reproduces the same lemma as the word form for both tokens (gonna→gon na), while EWT and GUM instead use its normalized counterpart (gonna→go to).
Alternatively, these contractions might be either decomposed and also normalized by mapping their components onto their standard form, i.e. using 'go' and 'to' as the normalized word forms and lemmas of a multi-token unit12 'gonna' (DWT, ITU,13 MNo), or rather left completely unsplit as a single token (and lemma) 'gonna' (TAAE, TWRO, Twb, Pst).
How these cases are annotated syntactically is not always specified in the respective papers, but the general principle seems to be that when contractions are split, the annotation is based on the normalized tokenization (Twb2, xUGC, ITU, FSMB, EWT, GUM), while when they are left unsplit, annotation is according to the edges connecting words within the phrase's subgraph (TAAE, Pst).According to this principle, Example 4 would thus be annotated according to the main role played by the verb 'go', as shown in Figure 2. As stated above, acronyms and intialisms may also pose a problem for tokenization, but in this case, there seems to be a higher consensus in not splitting them up into individual components, especially where an acronym is established and can be assigned a grammatical function without splitting, e.g.'TL;DR' (too long; didn't read) is left as a single token in GUM, with the reasoning that the form is conventional and likely to be pronounced as the acronym even when read aloud.
When the opposite strategy is used, that of multi-token units, the preferable option, in most cases, is not to merge the separate tokens (TAAE, TWRO, Frb, Twb2, Pst, FSMB, EWT).As a result, one token -either the first (TAAE, TWRO, Frb, Twb2, Pst, EWT, GUM) or the last one (FSMB) -is often promoted to represent the main element of the multi-token unit.This kind of "promotion" strategy, when put into practice, could actually mean very different things.In Frb, a distinction is drawn between morphological splits (Example 6) and simple spelling errors (Example 7): (6) he should buy anti vir programs ↔ antivir (from Foreebank) (7) i t keeps causing <ProductName> to lock up ...↔ it (from Foreebank) In the first case, both tokens are tagged based on the corresponding category of the intended word, i.e. as a NOUN (since 'antivirus' is a noun), while in the second one the two tokens are treated as a spelling error and an extraneous token, respectively.
In the remaining resources, neither explicit information nor regular/consistent patterns have been found concerning the morpho-syntactic treatment of these units.For their syntactic annotation in dependency grammar frameworks, common practice is to attach all remaining tokens to the one that has been promoted to head status.In UD corpora, the second (and subsequent) tokens in such instances are connected to the first token, and labeled with the special goeswith relation, which indicates superfluous whitespace between parts of an otherwise single token word.
Finally, a distinctive tokenization strategy is adopted in ATDT with respect to at-mentions, in which the '@' symbol is always split apart from the username, whereas other corpora retain the unsplit username along with the '@' symbol.

Other domain-specific issues
This category includes phenomena typical for social media text in general and for Twitter in particular, given that many of the treebanks in this overview contain tweets.Examples include hashtags, at-mentions, emoticons and emojis, retweet markers and URLs.These items operate on a meta-language level and are useful for communicating on a social media platform, e.g. for addressing individual users or for adding a semantic tag to a tweet that helps put short messages into context.On the syntactic level, these tokens are usually not integrated, as illustrated for Italian in Example 8.
(8) RT @user mi sono davvero divertito :D RT @user I really had fun :D (adapted from Twitter, 2013) It is, however, also possible for those tokens to fill a syntactic slot in the tweet, as shown in the Turkish example in (9).
(9) #kahvaltı zamanı time for #breakfast (from Twitter, 2019) In the different treebanks, we observe a very heterogeneous treatment of these meta-language tokens concerning their morpho-syntactic annotation.Hashtags and at-mentions, for example, are sometimes treated as nouns (DWT, ITU), as symbols (TWRO, Pst), or as elements not classifiable according to existing POS categories, or, more generically, as 'other' (Twb2).
Some resources adopt different strategies that do not fit into this pattern: in tweeDe and GUM, for example, at-mentions referring to user names are always considered proper nouns while hashtags are tagged according to their respective part-of-speech.Multi-word hashtags are annotated as 'other' in tweeDe (e.g.#WirSindHandball 'We are handball'), but as proper nouns in GUM (#IStandWithAhmed).In Twb2, a different POS tag is assigned to at-mentions when they are used in retweets.
Similarly to hashtags and at-mentions, links can either be annotated as symbols (TWRO, Pst), nouns (W2.0,ITU, FSMB), proper nouns (GUM), or 'other' (tweeDe, EWT). 14Emoticons and emojis, on the other hand, are mostly classified as symbols (TWRO, Twb2, tweeDe, Pst, EWT, GUM), less often as interjections (DWT, FSMB), and in one case as a punctuation mark sub-type (ITU).Retweet markers (RT) are considered as either nouns (DWT, Pst) or 'other' (Twb215 ).On the syntactic level, these meta-tokens are usually attached to the main predicate, but we also observe other solutions.As stated above, in tweeDe hashtags and URLs at the beginning or end of a tweet form their own sentential units, while in Twb, they are not included in the syntactic analysis.
Finally, in cases where meta-tokens are syntactically integrated, the recurring practice is to annotate them according to their role (TAAE, TWRO, DWT, Twb2 tweeDe, Pst, GUM).ATDT is unique in that it does not distinguish between meta-tokens at the beginning or end of the tweet and those that are syntactically integrated in the tweet, but instead always assigns a syntactic function to these tokens.
Based on the practices briefly outlined in this section, in the next section, we define an extended inventory of possible annotation issues, some of which occur in only one or a few resources, and propose a set of tentative guidelines for their proper representation within the UD framework.

Towards a Unified Representation
In this section we propose a unified approach to annotating the challenges outlined in Section 3.4, along with other phenomena that are often found in user-generated text, such as code-switching and disfluencies.We take into consideration the pros and cons for different annotation choices along with the observations we have made across the languages and data sets covered so far in our study.

Sentential unit of analysis
In the interest of maintaining compatibility with treebanks of standard written language, we propose splitting UGC data into sentential units to the extent to which it is possible and keeping token sequences undivided only when no clear segmentation is possible.To facilitate tweet-wise annotation if desired, a subtyped parataxis label, such as parataxis:sentence in Figure 3, could be used temporarily during annotation.Since some relation label will be needed to connect multiple sentential units within a tweet no matter what, this recommendation is mainly meant to help with later processing or comparison with other data sets, serving as a pointer to identify where the tweet could be split into sentences and distinguishing such junctures from other types of parataxis.

Tokenization
As shown in the examples in Table 1, user-generated content can include a number of lexical and orthographic variants whose presence have repercussions with respect to segmentation and choices presented to annotators.
The basic principle adopted in UD, for which morphological and syntactic annotation is only defined at the word level (universaldependencies.org 2019g), can sometimes clash with the complexity of these cases, which has also been a matter of debate within the UD community. 16Contractions: One particularly challenging issue for annotation decisions related to tokenization is contraction, i.e. when multiple linguistic tokens are contracted to form a single orthographic token (or into fewer tokens than the linguistic content would suggest).It is important to note the different types of contractions that can appear in UGC.For the cases of (i) conventionalized contractions, such as don't and (ii) erroneously merged words (e.g mergedwords), it is usually easy to identify the morpheme boundary split point.In these cases, we recommend that annotators split the contraction into its component tokens, in keeping with the UD guidelines (universaldependencies.org 2019a) already in place to deal with occurrences of such merging in standard text.However, for instances of (iii) deliberate informal contractions, such as colloquial abbreviations and initialisms (e.g.EN gonna, wanna, idk (I don't know)) or shorthand forms (FR nimp 'whatever'), standardized criteria are mostly inadequate, or at least insufficient to cover the whole host of possible phenomena.This is due to the ever-changing and often ambiguous nature of user-generated text, i.e. many of the colloquialisms common in UGC are also increasingly conventionalized in the standard language (e.g.gonna, which is frequent in print in certain registers, and ubiquitous in spoken language), while others may fall out of use entirely.Thus, whether or not a term is considered a conventional contraction is dependent on the time of annotation, and can also be largely subjective.It is also worth noting that increased annotator effort is required if informal contractions are split, as further challenges may be introduced with regard to lemmatization and capturing information for other downstream tasks.This can create a significant overhead in treebank development.For this reason, we advise annotators to adopt an individual approach that takes both treebanking consistency and feasibility into account.
Annotators may wish to consider whether an informal contraction has reached a non-compositional status (e.g.TL;DR, LOL, WTF, idk, etc. in English), and whether it functions solely as a discourse marker or actually bears a semantic and syntactic role within the sentence which is equivalent to its potential expansion (for example, TL;DR, which means'too long; didn't read', is often used in online content creation to provide readers with a shortened summary version of a text).In cases where decomposition of a conventionalized expression is avoided, but the whole function of the phrase is equivalent, our suggested approach is in line with the principle proposed in Blodgett et al. (2018) where annotation is carried out according to the root of the subtree of the original phrase.In the example below, the conventionalized form idk (sometimes spelled out when read aloud) is actually used in the place of a matrix verb and is therefore labeled as root, taking a complement clause argument ccomp.Some advantages of leaving deliberate, informal contractions unsplit are that less annotation effort would be required, consistency within the treebank would be easier to maintain, and fewer decisions would be left to the discretion of the annotator (such as the intention of the user and the compositionality of the term in specific instances).Additionally, treebank developers may consider this approach to be a more descriptive rather than prescriptive representation of 'noise' in the data.By contrast, the benefits of splitting such tokens are that it can be considered a cleaner approach as it will result in fewer ambiguous tokens and it will also allow for more fine-grained detail in the annotation, as well as comparability with resources in which equivalent split forms appear.-Unconventional Use of Punctuation: We recommend that unconventional use of punctuation in the form of pictograms :-) or strings of repeated punctuation marks !!!!!!! be annotated as a single token rather than being split.Further, we suggest that strings of emoticons be split so that each individual emoticon is considered an individual token, such as :):) → :) + :) (similar to other sequences of tokens spelled without intervening spaces).-Over-splitting: Another tokenization issue relates to the treatment of incorrectly split words.The UD guidelines already advise the use of the goeswith relation in cases of erroneously split words from badly edited texts (e.g.be tween, gele bilirim).This means that the split tokens are not merged, but information on their full form is captured nonetheless, while tokens containing whitespace are avoided.
In line with the specifications for erroneously split words (universaldependencies.org 2019a) -be it due to formatting, a typo or intentional splitting -we suggest to promote the first part of the word to the role of syntactic head and apply left-right attachment, regardless of any potential morphological analysis (i.e. the head of 'be tween' is 'be').The initial token would also bear the lemma, the POS tag and the morphological features of the entire word, while the remaining split parts would only be POS-tagged as X, and leaving the lemma and features unspecified (by convention '_').For instance in the Turkish example in Figure 5, Number and Person features, as well as others, are expressed in the bilirim part of the over-split word, but annotated in the FEATS column of the first part.

Lemmatization
With respect to the lemmatization of user-generated text, we note that the UD guidelines, specifically those referring to morphology (universaldependencies.org 2019e) can often be applied in a straightforward manner.However, certain phenomena common to UGC can complicate this task.In the cases of contraction, oversplitting and unconventional punctuation, lemmatization will depend on the tokenization approach chosen as discussed in the previous section.Unconventional uses of punctuation include punctuation reduplication, seemingly random strings of punctuation marks and pictograms or emoticons created using punctuation marks.Punctuation reduplication can be lemmatized by normalizing where a pattern is observed (?!?!? → ?!), otherwise the lemma should match the surface form (e.g.!!!!!1!!1 → !!!!!1!!1).We also recommend that emoticons and pictograms not be normalized (:]]→ :]]), seeing as any attempt of defining a finite set of 'conventional' emoticon lemmas would result in a somewhat arbitrary and incomplete list.When lemmatizing neologisms or non-standard vocabulary such as transliterations, we recommend that any inflection be removed in the lemma column (TR taymlaynda → taymlayn, '(in) timeline').If the token is uninflected, we suggest the lemma retain the surface form without any normalization (IT tuittare → tuittare, 'to tweet').

Features Column
UD prescribes the use of the features column to list information about the morphological features of the surface form of a word.We suggest that the feature Abbr=Yes be used for abbreviations such as acronyms, initialisms, character omissions, and contractions.Annotators may also choose to include the feature Style=X, employed by some treebanks to describe various aspects of linguistic style such as [Coll: colloquial, Expr: expressive, Vrnc: vernacular, Slng: slang].Among UGC UD treebanks, only TDT currently uses this feature.
Another useful feature prescribed by UD is Typo=Yes for seemingly accidental deviations from conventional spelling or grammar (used e.g. in GUM, EWT).The feature Foreign=Yes will be further discussed in Section 4.7 on code-switching.

MISC Column
At present, aside from capturing instances of spelling variations arising from abbreviation and typos, UD prescribes no mechanism for describing the nature of spelling variations.For this reason, we suggest the addition of a new attribute to the UD scheme to denote the more general case of non-canonical language and to more accurately describe the nature of the aforementioned phenomena.This additional attribute NonCan=X would be annotated in the MISC column with the following possible values: [AutoC: autocorrection, CharOm: character omission, Cont: contraction, Neo: neologism, OS: oversplitting, Phon: phonetization, PuncVar: punctuation variation, SpellVar: spelling variation, Stretch: graphemic stretching, Transl: transliteration, Trunc: truncation]. 17Additionally, the MISC column may be used to list values corresponding to a hypothetical standard or full form of the word, i.e. the attributes CorrectForm=X, FullForm=X, CorrectSpaceAfter=Yes may be useful in the cases of non-canonical language, abbreviations and incorrectly merged words respectively. 18he attribute LangID=X will be further discussed in Section 4.7 on code-switching.
4.6 Domain-specific issues UGC includes many words and symbols with domain-specific meanings.We recommend treating the various groups as follows: -Hashtags are to be labeled with the X tag as their Universal POS tag (UPOS).Information about the POS category corresponding to the morpho-syntactic role of the token is often captured in the XPOS column based on language specific guidelines e.g. in English, #besties/X/NNS, where NNS designates a plural noun.For languages in which XPOS guidelines do not express the desired distinction, it is possible to use an attribute FuncPOS in the MISC column to express this information.If a hashtag comprises multiple words, it should be kept untokenized, e.g., #behappy/X.Syntactically integrated hashtags should bear their standard dependencies.Classificatory hashtags at the end of tweets are to be attached to the root with the dependency subtype parataxis:hashtag as per the English example in Figure 6.-At-mentions to be labelled as PROPN.Their syntactic treatment is similar to hashtags: when in context they bear the actual syntactic role (see Figure 7 for a Turkish example), otherwise they should be dependent on the main predicate with the vocative label as per the Irish example in Figure 8. -URLs are to be tagged as SYM as per UD guidelines.They are often appended at the end of the tweet without bearing any syntactic function.Throughout our explored corpora, those URLs are diversely annotated, without an obvious consensus emerging: parataxis:url vs discourse:context vs dep.In cases where they are syntactically integrated in the sentence, we recommend that the XPOS should be recorded as NOUN if there are no existing XPOS tags for the language, otherwise the regular XPOS tagging guidelines apply.The token can be attached to its head with the appropriate dependency label as per Figure 9.We favour using parataxis:url for non-syntactically integrated URLs.-Pictograms are often used at the end of the tweets as discourse markers.In such cases they should be POS-tagged as SYM and attached to the root with the discourse relation.But there are also cases where they replace an actual word (or morpheme) in a syntactic context and may also be inflected (cf.10-11).-RTs are originally used with at-mentions so that the Twitter interface interprets it as a retweet, as in Figure 11.In such cases, their UPOS should be SYM with a dependency label parataxis attached to the root.However they are now more commonly used as an abbreviation for retweet within a tweet.The UPOS tag should be NOUN or VERB depending on its syntactic role.In these cases, the dependency relation depends on the functional role of the full form (see Figure 12).-Markup symbols (e.g.<, >, +++) have the UPOS SYM -similar to e.g., math operators in the UD guidelines -and we recommend they be attached to the head with punct as per the German example in Figure 13.

Code-switching
As discussed in Section 3, capturing code-switching (CS) in tweets is an additional motivation for following a tweet-based unit of analysis (Çetinoglu 2016;Lynn and Scannell 2019).CS -switching between languagesis an emerging topic of interest in NLP (Solorio and Liu 2008;Solorio et al. 2014;Bhat et al. 2018) and as such should be captured in treebank data where possible.CS can occur on a number of levels.CS that occurs at the sentence or clause level is referred to as inter-sentential switching (INTER) as shown between English and Irish in Example 13 and German and Turkish in Example 14: Word-level alternation (MIXED) describes the combination of morphemes from different languages or the use of inflection according to rules of one language in a word from another language.This is particularly evident in highly inflected or agglutinative languages.Example 16 shows the creation of a Turkish verb derived from the German noun Kopie 'copy'.
'The guy has 3-4 biographies copied and pasted.' (adapted from Twitter, 2016) While borrowed words can often become adopted into a language over time (e.g.cool is used worldwide), when a word is still regarded as foreign in the context of CS, the suggested UPOS is the switched token's POS -if known or meaningful -otherwise X is used (universaldependencies.org 2019d).The morphological feature Foreign=Yes should be used, and we also suggest that the language of code-switched text is captured in the MISC column, along with an indication of the CS type.As such, in Example 15, education would have the MISC values of CSType=INTRA | LangID=EN.
In terms of syntactic annotation, the UD guideline recommends that the flat or flat:foreign label is used to attach all words in a foreign string to the first token of that string (universaldependencies.org 2019c).We recommend that this guideline is followed (for both INTER and INTRA CS) when the grammar of the switched text is not known to annotators (see Figure 14).Otherwise, we recommend applying the appropriate syntactic analysis for the switched language (see Figure 15).Lemmatization of CS tokens can prove difficult if a corpus contains multiple languages that annotators may not be familiar with.To enable more accurate cross-lingual studies, all switched tokens should be (consistently) lemmatized if the language is known to annotators and annotation is feasible within the constraints of a treebank's development phase.Otherwise the surface form should be used, allowing for more comprehensive lemmatization at a later date.

Disfluencies
Similarly to spoken language, UGC often contains disfluencies such as repetitions, fillers or aborted sentences.This might be surprising, given that UGC does not pose the same pressure on cognitive processing that online spoken language production does.
In UGC, however, what may seem to be a production error can in fact have a completely different function (Rehbein 2015).Here, repetitions, self-repair and hesitation markers are often used with humorous intent (Example 17 demonstrates this for German).Disfluencies pose a major challenge for syntactic analysis as they often result in an incomplete structure or in a tree where duplicate lexical fillers compete for the same functional slot.
For UD, some treebanks with spoken language material exist (Lacheret et al. 2014;Dobrovoljc and Nivre 2016;Leung et al. 2016;Øvrelid and Hohle 2016;Caron et al. 2019), and among the treebanks surveyed here, GUM also contains some speech data.The UD guidelines propose the following analysis for disfluency repairs (universaldependencies.org 2019f) (See Figure 16).This treatment loses information whenever the reparandum does not carry the same grammatical function as the repair, as illustrated in German in Example 18.In this example from Twitter, the user plays with the homonymic forms of the German noun Hengst (stallion) and the verb hängst (hang 2.P s.Sg ).The missing information, however, can be easily added when using the enhanced UD scheme (universaldependencies.org 2019b) Other open questions concern the use of hesitation markers in UGC.We propose to consider them as multi-functional discourse structuring devices and annotate them as discourse markers, attached to the root, a practice already followed for example in GUM for both spoken data and reported speech within UGC.

Discussion
In this final section, we discuss some open questions in which the nature of the phenomena described makes their encoding difficult by means of the current UD scheme.

Elliptical structures and missing elements
In constituency-based treebanks that contain canonical texts, such as the Penn Treebank (Marcus et al. 1993), the annotation of empty elements results from the need to keep traces of movement and long-distance dependencies, usually marked with trace tokens and co-indexing at the lexical level in addition to the actual nodes dominating such empty elements.The dependency syntax framework usually does not use such devices as these syntactic phenomena can be represented with crossing branches resulting in non-projective trees.
In the specific case of gapping coordination, which can be analyzed as the results of the deletion of a verbal predicate (e.g.John loves i Mary and Paul (e i ) Virginia), both the subject and object of the right-hand side conjunct are annotated with the orphan relation (Schuster et al. 2017).Even though the Enhanced UD scheme proposes to include a ghost-token (Schuster and Manning 2016a) which will be the actual governor of the right hand-side conjuncts, nothing is prescribed regarding the treatment of ellipsis without an antecedent.Given the contextual nature of most UGC sources and their space constraints, those cases are very frequent.The problem lies in the interpretation underlying some annotation scenarios.Martínez Alonso et al. ( 2016) analyzed an example from a French video game chat log where all verbs were elided.Depending on contextual interpretation of a modifier, a potential analysis could result in two concurrent trees.Such an analysis is not allowed in the current UD scheme, unless the trees are duplicated and one analysis is provided for each of them.Following from a French example taken from (Martínez Alonso et al. 2016), Figure 17 shows an attachment ambiguity caused by part-of-speech ambiguity and verb ellipsis.A natural ellipsis recovery of the example shown in Figure 17 would read as 'Every time there are 3VS1, and suddenly I have -2 P4'.The token "3VS1" stands for "3 versus 1", namely an uneven combat setting, and 'P4" refers to a Minecraft character's protection armor.The token "-2" allows for more than one analysis.The first analysis is the simple reading as number, complementing the noun "P4", in blue in the graph below.A second analysis, in red, treats "-2" as a transcription of moins de (less of), which would be the preferred analysis given the P4 as an armor level interpretation.This example shows the interplay between frequent ellipses, ergographic phenomena and the need for domain knowledge in user-generated data.It also highlights the importance of annotators' choices when facing elided content as it would have been perfectly acceptable to use the orphan relation to mark the absence of, in this a case, a verbal predicate (eg.orphan(3VS1,fois) and orphan(p4,couˆ)).The next paragraph illustrates such an analysis on a case from German Twitter.The German example in Figure 18 below illustrates a type of antecedent-less ellipsis that occurs in the spoken commentary of sportscasters but is also used on Twitter by users who mimic such play-by-play commentary on football games they are watching in real time.As with the French video chat example in Figure 17, it is not clear which verb should be reconstructed in the first elliptical clause as there is no antecedent in the prior discourse.Given the context -Müller and Lewandowski are well known football players (in Germany) and the preposition auf signals a motion event, it is clear that the first conjunct reports an event where one player passes the ball to another.But the specific manner in which the ball is moved, whether it is 'headed' or 'kicked', could only be determined by watching footage of the game.The example also illustrates a second verb-less clause that is potentially difficult to recognize: 'TOR' heads its own clause and is coordinated with #Müller rather than being conjoined to #Lewandowski.The relevant clue is the capitalization that evokes the loudness and emphasis of the goal cheer.Again, one cannot be fully confident which verb to reconstruct here: several existential-type verbs are conceivable.Example 19 below illustrates a further variation of the above case in German: the PPs can be iterated to iconically capture a series of passes.Thus, in the example below, Müller is not only the recipient of the ball from James but also the one that passes it on to Lewandowski.However, it is not clear what structure to assume in an enhanced UD analysis that would explicate this.One could assume (i) two explicitly coordinated clauses, (ii) two clauses related by parataxis, or (iii) the use of a relative clause for the second clause.None of these analyses would be obviously right or wrong.
(19) TOR -James auf Müller auf #Lewandowski: die Entscheidung (87.) GOAL -James to Müller to #Lewandowski: the clincher (87.) xxx (from Twitter, 2020) In any event, it is very likely that the growing number of UD treebanks containing user-generated content (and/or spoken language) will be found to feature many constructions that cannot readily be handled based on the existing guidelines for written language.

Limitations
In focusing only on user-generated content at the sentence level, our proposal does not cover phenomena that spread over multiple sentences, which would be relevant at a discourse annotation level and seen for example in cases of extra-sentential references.In the case of threaded discussions, similar to dialogue interaction, cases of gapping and more generally syntactic ellipsis can occur.These are not covered by our proposal, nor are they permitted in the UD framework, as they would require a more elaborate token indexing scheme spanning over sentences.
Like any content expressed in digital (i.e.non-handwritten) media, any conceivable variation of ASCII art can be used and carry meaning (Figure 19).Formatting variations, such as a recent trend of two-column tweets, as shown in Figure 20, are observed where some graphical layout recognition is needed to interpret the two columns as two consecutive sentences.This is similar to challenges in standard text corpora acquired from visual media, such as literary corpora from multi-column pages digitized by Optical Character Recognition (OCR).Cases such as this do not require any specific annotation as this proposal refers to the processing of (mostly) text-based interactions.
Another difficult phenomenon to annotate lies in the multi-modal nature of most user-generated content platforms that enable the inclusion of various media contents (picture, video, etc.) that often provide context to a tweet or provide meta-linguistic information that changes the whole interpretation of this content.While those phenomena do not change the core syntactic annotation per se, they can change the way that tokens such as proper noun strings or URLs can be interpreted.

Interoperability and other frameworks
UD enhancements and modifications The UD framework is an evolving, ongoing community effort, informed by contributors from wide and varied linguistic backgrounds.One line of changes being discussed among treebank developers concerns the inventory and design of the UD dependency relations.Croft et al. (2017) proposed a redesign of the relations-inventory that would reflect four principles from linguistic typology while essentially not changing the topology of UD trees.By contrast, Gerdes et al. (2018) proposed a surfacesyntactic annotation scheme called SUD that would follow distributional criteria for defining the dependency tree structure and the naming of the syntactic functions.SUD results in changes to the dependency inventory and to tree topology.Along another, more consensual line of exploration, proposals have been made regarding how to augment UD annotations so as to explicate additional predicate-argument relations between content words that are not captured by the basic surface-syntactic annotations of UD.Schuster and Manning (2016b) formulated five kinds of enhancements, such as propagating the relation that the head of a coordination bears to the other conjuncts.Candito et al. (2017) added the neutralization of diathesis alternations as another kind of enhancement.Given that some of the enhancements require human annotations, Droganova and Zeman (2019) propose to identify a subset of deep annotations that can be derived semi-automatically from surface trees with acceptable quality.The enhanced representation that results from these various proposals forms a directed graph but not necessarily a tree.It may contain 'null' nodes, multiple incoming edges and even cycles.Note that within the UD community, the enhancements are optional: it is acceptable for a treebank to be annotated with just a subset of possible enhancements.
In order to highlight the interoperability of our proposed guidelines with other frameworks (such as SUD, (Gerdes et al. 2018)), in Figure 21 and in Figure 22 we display the same Italian tweet, represented in UD framework -on the left -and in SUD framework -on the right.As it can be seen, in the text, a copula ellipsis occurs, as it is fairly common in news headlines and other kinds of user-generated content.The different approaches of UD and SUD, with regard to the election of syntactic heads and their dependents, or the different naming of syntactic functions, both pose no problems in the syntactic representation of such a case, therefore demonstrating the interoperability of our proposal.Treatment of morphology: alignment with UniMorph Like the UD group, the collaborative UniMorph project (Kirov et al. 2016(Kirov et al. , 2018;;McCarthy et al. 2020) is developing a cross-lingual schema for annotating the morphosyntactic details of language.While UD's main focus is on the annotation of dependency syntactic relations between tokens in corpora of running text, UniMorph focuses on the analysis of morphological features at the word type level.Nevertheless, UD corpora also include more detailed annotations of lexical and grammatical properties of tokens beyond POS.
The attribute-value pairs used for morphological features in UD's schema are constructed in a bottom-up fashion: features are adapted and newly included as evidence comes in from languages for which resources are created.By contrast, UniMorph is top-down: its design is guided by surveys of the typological literature and aims to be complete, accounting for all attested morphological phenomena.Accordingly, UniMorph provides some attributes that UD lacks, such as those which describe information structure and deixis.In other cases, UniMorph provides more values for certain attributes: for instance, it covers 23 different noun classes used by Bantu languages.
As reported by Kirov et al. (2018), a preliminary survey of UD annotations shows that approximately 68% of UD features have direct UniMorph schema equivalents, with these feature sets covering 97.04% of the complete UD tags. 19As the authors note, some UD features are outside the scope of UniMorph, which marks primarily morphosyntactic and morphosemantic distinctions.For example, UD has markers for abbreviated forms and foreign borrowings, which UniMorph does not provide as it limits its features to those needed for capturing the meanings of overt inflectional morphemes.Conversely, some UniMorph features are not represented in UD due to its bottom-up approach.
While we present recommendations on user-generated content in this article, we neither extend the UPOS tagset nor the morphological features of the UD scheme.In that sense, existing mappings between UD and UniMorph are applicable to social media corpora.Nevertheless, our recommendations for the annotation of UGC go beyond the scope of UniMorph's top-down approach with new types of tokens, e.g. the morphological features of pictograms.As McCarthy et al. (2020) note, UniMorph now recognizes that lemmas and word forms can be segmented, hence in case of clitics or agglutinative formations, morphological features can be mapped onto segments of a word form.The resulting policy should be taken into account in aligning the current UD and UniMorph.

Conclusion
In this article we addressed the challenges of annotating user-generated texts from web and social media, proposing, in the context of Universal Dependencies, a unified scheme for their coherent treatment across different languages.
The variety and complexity of the UGC-related phenomena sometimes renders as non-trivial their adequate representation by means of an already existing scheme, such as UD.We hope that this proposal will trigger discussions throughout the treebanking community and will pave the way for a uniform handling of user-generated content in a dependency framework.

Fig. 1
Fig.1Diagram of UGC linguistic phenomena.Our focus is on the noncanonical linguistic phenomena prevalent in UGC which do not yet have standardized annotation guidelines within the UD framework.These elements in boldface are exemplified in Table1.
Thank you 4 All U do & ing dogs (from Twitter, 2020) These cases are to be annotated with the lemma, UPOS tag and dependency relation of the word they substitute.The French example in Figure 10 demonstrates both cases.The morphological features should reflect the intended meaning.Thus, in example 12 the feature for the pictogram/verb should be Person=3 even though the form canonically is a non-third-person form.(12) Go follow @IMAPCT_Zodiak he's a beast & he'll follow back.He his followers.(from Twitter, 2020)

Fig. 13
Fig. 13 German example of the use of markup symbols from Twitter, 2020.

(
13) "Má tá AON Gaeilge agat, úsáid í! It's Irish Language Week."If you have ANY Irish, use it!It's Irish Language Week.(from Twitter, 2014) (14) "@user Jedem das was er verdient.;-) Yoksa Köln'den Almanca ögrenmeden mi döndün" Everyone gets what they deserve ;-) Or did you return from Cologne without learning German?(from Twitter, 2014) INTER switching can also be used to describe bilingual tweets where the switched text represents a translation of the previous segment: "Happy St Patrick's Day! La Fhéile Pádraig sona daoibh!"This phenomenon is often seen in tweets of bi-/multi-lingual users.CS occurring within a clause or phrase is referred to as intra-sentential switching (INTRA).Example 15 demonstrates INTRA switching between Italian and English: (15) "Le proposte per l'education di Confindustria" 'The proposals for the education by Confindustria' (adapted from TWITTIRÒ, 2014)

Fig. 16
Fig. 16 Example tree from the UD guidelines, showing the use of reparandum.
Fig. 17 Problematic example with two contesting structures from two different readings of the token "-2" surrounded by at least 2 elided elements.(Adapted to UD v2.5 from (Martínez Alonso et al. 2016))
Italian example from Twitter, 2015, of multiple sentential units in a tweet. .

Table 3
Summary of CoNLL-U proposed implementations.