Language Resources and Evaluation

, Volume 44, Issue 3, pp 281–290

Google the verb

Authors

Article

DOI: 10.1007/s10579-010-9117-9

Cite this article as:
Kilgarriff, A. Lang Resources & Evaluation (2010) 44: 281. doi:10.1007/s10579-010-9117-9

Abstract

The verb google is intriguing for the study of morphology, loanwords, assimilation, language contrast and neologisms. We present data for it for nineteen languages from nine language families.

Keywords

Multilingual morphologyLoanwordsAssimilationNeologism

1 The case

There are several reasons why the verb google is an appealing object for linguistic research.
  • It exists in many languages, with the same core meaning. (For most words it does not make sense to say that the same word exists in many languages. However names, and technical terms, can be language-independent. For google, it does seem to make sense to say that the ‘same’ verb exists in many languages.)

  • It is new: it has not had time to develop idiosyncratic morphological, phonological or syntactic behaviour, so, like the invented words used in psycholinguistic experiments, it allows us to view the default behaviour for each language

  • Unlike invented words, it is common and can be explored using corpus methods

  • Most new words are nouns, but verbs tend to show more morphological and syntactic complexity so support a wider range of research questions

  • For English, google is phonetically and orthographically an unexceptional word which readily adopts standard inflections and other kinds of linguistic variation in speech and in writing. (This does not apply to Yahoo!, in speech or in writing.) We think this will be fairly true for google in at least some other languages, though that is an outcome rather than an input to the research

  • As a search term, google works well and is easily searched for, in all of its variant forms, in most of the languages we have investigated.

In our corpus query tool, the Sketch Engine, we have general, recent web corpora for a number of languages, gathered as described in Baroni et al. (2009), Sharoff (2006), and Kilgarriff et al. (2009). In the tool we can conveniently search for all forms of the verb, and compute their frequencies-per-million, so, where we had a suitable corpus, this was done. In other cases, a commercial search engine was used.

2 The data

2.1 Germanic languages

 

Dutch

NlWaC

128m

google

1sg, n

670

googlen, googelen, googleen, google-en, goegelen, google’n

inf, 1,2,3 pl, n

55

googled, googelt, googlet

2, 3, sg, n

16

googelde, googlede

past sg

2

gegoogled, gegooglet, gegoogeld, gegoogelt, gegoogle’t

pastpart

37

Total

6.7 pm

862

English

UKWaC

1,527m

google

base, n

2488

googling, googleing

prespart, gerund

243

googled

past, pastpart

178

googles

3 sg, n pl

22

Total

1.98 pm

3031

German

DeWaC

1,627m

google, googel, googl, googele

1 sg

1395

googlen, googln, googeln, googleln, gugeln

infin, 1,3 pl

681

gegooglet, gegoogled, gegugelt, gegoogl, gegoogelt

pastpart, 3 sg, 2 pl

480

googlet, googled, googelt

3 sg, 2 pl

105

googelte, googlete

past 1 sg, 3 sg

10

googlest, googelst

2 sg

39

gegoogelte, googelnde

pastpart adj f sg

5

gegoogelten

pastpart adj pl

2

ergoogle

1 sg

1

ergooglen, ergoogeln, ergugeln

infin, 1 pl, 3 pl

51

ergoogelt, ergooglt, ergooglet

pastpart, 3 sg, 2 pl

51

ergoogelte

past 1 sg, 3 sg

7

ergoogeltes

pastpart adj neuter

2

ergoogled

3 sg, 2 pl

1

Total

.315 pm

513

Norwegian

Newspaper

788m

google

infin

259

googler

present

99

googlet, googla

past, pastpart

54

googles

passive

3

googlede

pastpart def

1

googlende

prespart

1

Total

.52 pm

417

Swedish

Informal web

18m

googla

infin

23

googlar

pres

11

googlade

past

6

googlat

supine

13

googlande

prespart

5

Total

3.2 pm

58

Notes for data in all tables:

• Inclusion

– variants for the same item in the verbal paradigm are comma-separated

– only verb forms included, although counts include nouns as well where the same form can be noun or verb. In these cases the noun option is indicated after semi-colon

– derivational morphological not included, except where noted below

• order: forms listed in frequency order, or, where that disguises the structure of the paradigm, standard paradigm order

• normalisation: all Latin-alphabet characters normalised to lowercase except where uppercase indicated a name or a noun: then, those cases were excluded

• corpus name is given where this has been used in publications or on the Sketch Engine website; in other cases we give a minimal description of the corpus type, or a note of the search engine used for direct web-searching

• the naming of grammatical roles cannot be done with precision where space is limited and the data covers a wide range of languages, and this is in any case marginal to the paper. Grammatical labels are indicative only. Where no tense is given, tense is present; where no mood is given, mood is indicative. A comma indicates syncretism: the form realises multiple grammatical roles

• Frequencies per million (for the verb as a whole) are given in most cases where the corpus size is known, in an attempt to make it possible to compare behaviour between languages. However these figures are to be viewed with great caution, not only because the corpora differ in a wide variety of ways, but also because the noun is always far more common than the verb, and in some cases the overall count given will include many noun cases which could not reliably be distinguished from verbal ones

Dutch and German show a large number of spelling variants. Amongst other things, in Dutch and German spelling the le ending is not standard. Some authors have retained it, others have changed it to el, others have deleted the e altogether, and couple of authors have covered all bases, with an l in both possible places: googleln. Frequencies for Dutch and English cannot be compared with others because of syncretism between the verb and the much more common noun. The high frequency (per million) in the Swedish corpus, which was collected explicitly to explore informal language, is noteworthy, though based on low numbers.

We have included German ergooglen, a derived verb where the prefix means ‘creative process’. This was a common variant on the base verb with an aspectual meaning contrast: see also notes on Slav languages and Chinese below. Other prefixed forms are not included in the table: the second most frequent was rumgooglen, a contraction of herumgooglen meaning “google around”, which always occurred in collocation with a quantity expression, usually ein bisschen rumgooglen, “google around a bit”.

2.2 Romance languages

 

Italian

ItWaC

1,909m

googlare

infin

29

googlato

pastpart

27

googlando

gerund

26

googlate

imper pl, n pl

18

googla

imper sg, 3 sg

8

googlo

1 sg

3

googlò

past

1

googlasse

subj, 3 sg

1

Total

.059 pm

114

Romanian

Web via Google

googăli, gugăli

infin

7210

googălesc, gugălesc

1 sg, 3 pl

6780

googăleşti, gugăleşti

2 sg

4670

googăleşte, gugăleşte

3 sg, imper sg

6500

googălim, gugălim

1 pl

1387

googăliţi, gugăliţi

2 pl, imper pl

1804

googălit, gugălit

pastpart, future

20,430

googăleam, gugăleam

past cont 1 sg

514

googăleai, gugăleai

past cont 2 sg

10

googălea, gugălea

past cont 3 sg

5

googăleaţi

past cont 2 pl

1

Spanish

Internet Es

117m

googleando

gerund

11

googlear

infin

8

googleo

1 sg

1

googleas

2 sg

1

googleadme

imper + pronoun

1

Total

0.19 pm

22

In Spanish and many other languages, pronouns are sometimes written attached to the verb, as in googleadme, which is included to illustrate the issue and because, after detaching the pronoun, the remaining form is the only imperative found for Spanish.

2.3 Slav languages

 

Czech

Web crawl

800m

googlen

passive

1

progooglovat

“google through” infin

1

progoogluj

“google through” imper

1

vygooglovat

“find by google”

1

Total

.005 pm

4

Russian

Web crawl

188m

пoгyглитe

imper pl

6

пoгyглил, нaгyглил

past 3 sg m

3

пoгyглилa

past 3 sg f

2

гyглить

infin imperf

2

гyглю

1 sg

2

пoгyглить, нaгyглить

infin perf

2

гyглят

3 pl

1

пoгyглив

past gerund

1

пpoгyгли

imper sg

1

Total

.106 pm

20

Slovak

SNK 4.0

526m

googlovat’

infin

7

googlujú

3 pl

1

googluj

imper 3 sg

1

gúgli

imper 3 sg

1

gúglit’

infin imperf

1

nagooglit’

infin perf

1

pogooglovat’

infin

1

pregooglujú

3 pl

1

negooglovali

past 3 pl neg

1

vygooglit’

infin perf

2

vygooglite

2nd pl

1

vygoogli

imper 3 sg

1

vygooglených

pastpart gen pl

1

vygooglené

pastpart nom pl

1

vygooglim

1 sg

1

vygooglovat’

infin

2

vygooglujem

1 sg

1

vygooglovaná

pastpart nom f

1

vygooglovali

past 3 pl

1

vygooglovala

past 3 sg f

1

vygooglujeme

1st pl

1

vygúglená

pastpar nom f

1

vygúgli

imper 3 sg

1

vygúglili

past 3 pl

1

zagúglite

2 pl

1

Total

.063 pm

33

Slovene

FidaPLUS

620m

guglanje, googlanje

gerund

8

poguglati, pogooglati

infin

7

guglati, googlati

infin

6

prigooglati

infin

4

Total

.040 pm

25

Amongst the Slav languages we have included verb forms with prefixes relating to aspect. While they are usually treated as derivational morphology, aspect is often conveyed by inflectional and other grammatical means in other languages so they have been included here.

We are struck by the very low frequencies for Czech: we wonder if this is because this particular corpus includes more formal data than some others (compare the Swedish, which is informal by design), or because Ceznam, not Google, is the leading search engine in the Czech Republic, or for more linguistic reasons: perhaps Czech is not a language that forms verbs so readily.

2.4 Celtic languages

 

Irish

Web via google

googláil, gúgláil, ghoogláil

gerund

36

ghoogláil, ghúgláil

infin

25

googlóidh

future

2

googlaigh, gúgal

imperative

2

ghooglaigh

past

1

gúgaláilte

verbal adj

1

Welsh

Web crawl

120m

gwglo, googlo, googlio, gwglio

base v, n

207

gwglwyd

impers perf

4

gwglwch, googlwch

imp pl, 2 pl

2

googlia, gwglia

imp sg

2

gwglais

1 sg perf

1

Total

1.80 pm

216

The Welsh derived forms included gwglbomio, ‘googlebombing’.

2.5 Greek

 

Greek

GkWaC

  

149 m

 

γκουγκλάρω, γκουγκλίζω

1 sg

17

γκούγκλιζες

past cont, 2 sg

1

γκουγκλάρουμε

1 pl

1

γκούγκλισα

past 1 sg

5

γκουγκλάρουν

3 pl

1

γκουγκλίσει

subj, 3 sg

1

googlάρεις

2 sg

2

γκουγκλάροντας, googlίζοντας

gerund

4

γκούγκλαρα, googlαρα, γκούγκλιζα

past cont, 1 sg

7

γκουγκλίστε

imper 2 pl

1

Total

.26 pm

 

40

2.6 Asian languages

https://static-content.springer.com/image/art%3A10.1007%2Fs10579-010-9117-9/MediaObjects/10579_2010_9117_Figa_HTML.gif

The Asian languages covered raise a number of additional issues. Both Persian and Telugu are languages which make extensive and systematic use of light verb constructions, so the verb google usually translates as something like the compound verb do google.

Chinese has no inflectional morphology and a weaker noun/verb distinction than many languages. It has a writing system without spaces between words and a correspondingly weaker distinction between words and multi-word units. It also presents challenges when one wishes to write a word that one has not seen written before. Aspect markers are the indicators of verb-hood, and here we present the stem (google in Latin or 谷歌, the Chinese-writing name of the company) + aspect markers.

In many languages there is an unresolved tension between English-like and localised orthography, applying to, inter alia, the choice of character set (in Chinese, Greek) and in the orthographic realisation of the vowel group (with English oo not being native to many orthographies: in most cases the alternative is u, in Welsh it is w).

3 Conclusion

We present a data set for the verb google across many languages. It presents an interesting testing-ground for a range of ideas on morphology, loanwords, assimilation, language contrast and neologisms. We hope it will stimulate further thinking in these areas.

Acknowledgments

With thanks to Serge Sharoff and the Bologna group for permission to use their corpora in the Sketch Engine. For the specific language expertise I would like to thank: Gisle Andersen, PVS Avinesh, Núria Bel, Vladimir Benko, Sebastian Burghof, Eugenie Giesbrecht, Andrew Hawke, Abhilash Inumella, Håkan Jansson, Vojtĕch Kovář, Simon Krek, Monica Macoveiciuc, Mavina Pantazara, Behrang QasemiZadeh, Siva Reddy, Bettina Richter, Pavel Rychlý, Marina Santini, Simon Smith, Elaine Uí Dhonnchadha, and Carole Tiberius.

Copyright information

© Springer Science+Business Media B.V. 2010