Corpus e Lessico di Frequenza dell'Italiano Scritto (CoLFIS)
Pier Marco Bertinetto°, Cristina Burani*, Alessandro Laudanna^ *, Lucia Marconi#,
Daniela Ratti#, Claudia Rolando# †, e Anna Maria Thornton§
° Scuola Normale Superiore, Pisa
* Istituto di Scienze e Tecnologie della Cognizione, CNR, Roma
^ Università di Salerno
# Istituto di Linguistica Computazionale, Unità Staccata di Genova, CNR, Genova
§ Università de L'Aquila
† in memoriam
The reference corpus consists of excerpts from newspapers (published between 1992 - 1994), magazines and books, including textbooks and books relating to professional interests. The corpus comprises 3.798.275 lexical occurrences, so distributed:
The corpus was designed as the best approximation to the Italians' average preferred readings, as proposed by official statistics (ISTAT). It thus mirrors the actual experience of the Italian readers.
For a detailed corpus description, see:
Laudanna, A., Thornton, A.M., Brown, G., Burani, C. e Marconi, L. (1995). Un corpus dell'italiano scritto contemporaneo dalla parte del ricevente. In S. Bolasco, L. Lebart e A. Salem (a cura di), III Giornate internazionali di Analisi Statistica dei Dati Testuali. Volume I, pp.103-109. Roma: Cisu
The frequency lexicon consists of two main components: the forms repertoire and the lemmas repertoire.
The forms repertoire lists the frequency of each corpus form, without distinguishing between the possibly diverging lexical entries. For instance, porti accounts for one form, disregarding its interpretation as either noun or verb (see below).
The lemmas repertoire, instead, disambiguates all identical forms belonging to different lemmas. For instance, porti is listed as the plural of porto 'harbor', or as the present indicative's second person singular of portare 'to bring'. In addition, the lemmas repertoire treats syntagmatic words as single entries. By 'syntagmatic words' we refer to complex locutions consisting of two or more words, whose meaning is often independent of the individual components. For instance, Divina Commedia '(Dante's) Divine Comedy', gamba di tavolo 'table's leg', a causa di 'due to', spesse volte 'often'.
CoLFIS' appeal, as compared to previous Italian frequency lists, may be summed up as follows:
- Careful corpus balancing. This attaches a non-fortuitous character to quantitative informations;
- Corpus dimension. Although modern computational technology provides efficient automatic lemmatization tools, there are not many examples of comparatively large lexical resources, such that the automatic screening's result has been systematically checked by human operators. This enhances the lemmatization's trustworthiness, with special regard to syntagmatic words.
The frequency lexicon files are available for free download:
The corpus (limited to the authorized portion) is available at: www.ge.ilc.cnr.it/strumenti.php.
The work has been produced with funding of CNR (Consiglio Nazionale delle Ricerche) *, for quite a long time an invaluable support to scientific research in Italy. With the help of willing users, this product will hopefully be enriched with further facilities.
All files may be downloaded in three different formats: .txt (text only), .mdb (Microsoft Access) and .dbf (accessible with Filemaker by Mac and PC users or DBase in dos environment).
.txt files may be converted by means of any data base program. As to .mdb files, the most efficient way to access them is by using Microsoft Access 97 or later versions (Microsoft Excel cannot cope with so large files; it might cause loss of data).
CoLFIS presentation and the introduction to file browsing in the .txt, .mdb, .dbf format, have been made by Pasquale Rinaldi and Cristina Burani (Istituto di Scienze e Tecnologie della Cognizione, CNR, Roma)
The examples of Access queries have been made by Pasquale Rinaldi (Istituto di Scienze e Tecnologie della Cognizione, CNR, Roma).
The examples of Filemaker queries have been made by Maddalena Agonigi (Scuola Normale Superiore, Pisa).
The frequency lexicon files are available for free
* CNR - National Committees "Information Science and Technology" and "Historical, Philosophical and Philological Sciences". Project grant: Lexical Data Base of Contemporary Written Italian, to Pier Marco Bertinetto, Cristina Burani, and Lucia Marconi.