The LUCEA database is a database of existing speech recordings of L1 and L2 speakers of English. The recorded speakers are students from an international student community where English is used as lingua franca. These students are being recorded longitudinally throughout their 3-year period on campus, using read and spontaneous speech in L1 and in L2 English (or in L1 English only). The database is of interest for research and development in linguistics, language education (pronunciation training), speech technology (foreign accent detection, language recognition, speech recognition), and sociophonetics.


The LUCEA Database consists of speech recordings of students at the University College Utrecht (UCU), the international honours college of Utrecht University, the Netherlands. At UCU, English is the language of instruction and also the social lingua franca. So far, 72 (cohort 2010) and about 70 (cohort 2011) students have been recruited, as well as 17 exchange students who are native speakers of English. All speaker groups are about 70% female and 30% male. For the 2010 students, 3 recordings are available. For the 2011 students and for the exchange students, 1 recording is available.

Each recording consists of approx 20 minutes of speech, half of which is read (English read texts and sentences, and L1 read text for 2011 cohort) and half of which is spontaneous (2 minutes English monologue on informal topic, 2m monologue in L1 on informal topic, 2m English monologue on formal topic, 3m English conversation with interviewer). Each recording is made using a strict protocol describing the exact location of microphones, speakers’ instructions, recording procedure, etc.
The speech recordings are not yet annotated nor transcribed. The transcripts of the read texts are available in PDF. The metadata scheme developed in this project allows future additions of annotations.

The recordings are part of the larger and ongoing LUCEA project in which the recruited students are being followed throughout their 3-year period on the UCU campus. This will result in about 900 recording sessions. At present, about 280 [(72+65+55) + (~70) + (17)] recordings are available.

  • Project leader: Dr Hugo Quené (Utrecht University) 

  • CLARIN center: Max Planck Institute for Psycholinguistics, Nijmegen
  • Web-sites: http://lucea.wp.hum.uu.nl/
  • Tool/Service link: soon on http://corpus1.mpi.nl/
  • Data Link (VLO): http://catalog.clarin.eu/vlo/search?q=D-LUCEA
    • Orr, R., H. Quene, R. Beek, T. Diefenbach, D. van Leeuwen, M. Huijbregts (2011), An international English speech corpus for longitudinal study of accent development, Proceedings Interspeech 2011, pp. 1889-1892.
    • Orr, R., Huijbregts, M., Beek, R. van, Teunissen, L., Backhouse, K., Leeuwen, D. van (2014), Semi-automatic annotation of the UCU accents speech corpus. LREC 2014: 1483-1487.
    • Quené, H. & Orr, R. (2014), Long-term convergence of speech rhythm in L1 and L2 English. Speech Prosody 2014.



