The Sources of the BDPA
The table below gives an overview of all the sources in the BDHL database. Click on the links in the table in order to view a more detailed description of each source. Detail description includes information on the original source from which the data was taken, link to the original data source, name(s) of the person(s) who compiled and aligned the data, as well as a link where manually edited pairwise and multiple sequence alignments for a given source can be downloaded (along with the unaligned input files).
Dataset | Languages | PSA number of pairwise alignments |
MSA number of multiple alignments |
NoW number of words |
NoL number of languages |
PID average percentage identity |
Andean | Andean languages (Aymara, Quechua) | 619 | 76 | 883 | 20 | 55 |
Bai | Bai dialects | 889 | 90 | 1416 | 17 | 32 |
Bulgarian | Bulgarian dialects | 1595 | 152 | 32418 | 197 | 48 |
Dutch | Dutch dialects | 500 | 50 | 3024 | 62 | 44 |
Dutch_RND | Dutch Dialects | 1651 | 166 | 50046 | 363 | 45 |
French | French dialects | 712 | 76 | 3810 | 62 | 41 |
Germanic | Germanic languages and dialects | 1110 | 111 | 4775 | 45 | 32 |
Japanese | Japanese dialects | 219 | 26 | 224 | 10 | 40 |
Norwegian | Norwegian dialects | 501 | 51 | 2183 | 51 | 46 |
Ob-Ugrian | Uralic languages | 444 | 48 | 689 | 21 | 45 |
Romance | Romance languages | 297 | 30 | 240 | 8 | 37 |
Sinitic | Chinese dialects | 200 | 20 | 20 | 40 | 35 |
Slavic | Slavic languages | 120 | 20 | 100 | 5 | 38 |
Andean
The data comes from the Sounds of the Andean Languages project (2001-2004) at the university of Sheffield, UK. Main contributor was Paul Heggarty, who conducted the field work and provided the sound recordings and the phonetic transcriptions which are available on the project website. For our benchmark database, we used 76 cognate sets distributed over 21 varieties of Quechua and Aymara.
Original Data: | http://www.quechua.org.uk/ |
---|---|
Compiled by: | Paul Heggarty |
Aligned by: | J.-M. List |
Download: | andean.zip |
Cite as: | Heggarty, P. (2006): Sounds of the Andean languages. URL: http://www.quechua.org.uk. |
Bai
The data on the Bai dialects is a compilation of the two independently conducted studies by Wang (2006) and Bryan Allen (2007). From these sources, 90 cognate sets distributed over 17 language varieties were extracted.
Original Data: | Wang (2006), Allen (2007) |
---|---|
Aligned by: | J.-M. List |
Download: | bai.zip |
Cite source as: | Wang, F. (2006): Comparison of languages in contact. The distillation method and the case of Bai. Taipei: Institute of Academia Sinica. Allen, B. (2007): Bai dialect survey. SIL International. URL: http://www.sil.org/silesr/2007/silesr2007-012.pdf. |
Bulgarian
Bulgarian dialect data comes from the Buldialect project (Buldialect - Measuring Linguistic Unity and Diversity in Europe, 2006-2010). This was a joint project between the Eberhard-Karls University Tübingen, the University of Groningen and the Bulgarian Academy of Sciences. Bulgarian data in the BDHL contains the phonetic transcriptions of 152 words collected at 197 sites distributed all over Bulgaria. The data was collected in a such way that it represents the most important phonetic features described in the traditional literature on Bulgarian dialects.
Original Data: | http://www.jelenaprokic.eu/buldialect |
---|---|
Aligned by: | J. Prokić |
Download: | bulgarian.zip |
Cite as: | Prokić, J.; Nerbonne, J.; Zhobov, V.; Osenova, P.; Simov, K.; Zastrow, T. and E. Hinrichs (2009): "The computational analysis of Bulgarian dialect pronunciation". Serdica Journal of Computing 3.3:269—298. |
Dutch
Dutch dialect data comes from the Goeman-Taeldeman-Van Reenen project (GTRP, 1980-1995) at the Maartens Institute in the Netherlands. Our benchmark data contains aligned transcriptions of 50 words from the GTRP collected at 62 places.
Original Data: | http://www.meertens.knaw.nl/mand/database/ |
---|---|
Aligned by: | J.-M. List |
Download: | dutch.zip |
Cite as: | de Schutter, G., B. van den Berg, T. Goeman, and T. de Jong (2007): MAND. Morfologische Atlas van de Nederlandse Dialecten. URL: http://www.meertens.knaw.nl/mand/database/ |
French
The data on French dialects comprises a small excerpt of 76 cognate sets distributed over 60 dialect varieties taken from the "Tableaux phonétiques des patois suisses romands". The data was partially digitized at the Heinrich Heine University Düsseldorf under the supervision of Hans Geisler. Our collection is based on a simplified conversion of the original phonetic orthography into IPA.
Original Data: | Gauchat et al. (1925) |
---|---|
Digitized by: | H. Geisler |
Aligned by: | J.-M. List |
Download: | french.zip |
Cite source as: | Gauchat, L. et al. (1925): Tableaux phonétiques des patois suisses romands. Neuchâtel: Attinger. |
Germanic
The data was taken from the Languages & Origins in Europe project at the McDonald Institute for Archaeological Research (2006-2009). The project website offers sound files and phonetic transcriptions of different European language families (Germanic, Romance, Slavic). For our benchmark database we used 111 cognate sets distributed over 45 language varieties (English dialects and Germanic languages) in an IPA transcription that was slightly simplified according to the original.
Original Data: | http://www.languagesandpeoples.com/ |
---|---|
Compiled by: | P. Heggarty |
Aligned by: | J.-M. List |
Download: | germanic.zip |
Cite source as: | Heggarty, P. (2007): Languages and Origins in Europe. URL: http://www.languagesandpeoples.com/. |
Japanese
Data was taken from Shirō (1973). The data was digitized by three students from Heinrich Heine University Düsseldorf during a course on historical linguistics. Here, it is given in a form in which the original phonetic transcriptions were slightly modified in order to be in concordance with IPA standard.
Original Data: | Shiro (1975) |
---|---|
Digitized by: | M. Dickmanns, S. M. Oetzel, and K. Vogt |
Aligned by: | J.-M. List |
Download: | japanese.zip |
Cite source as: | Shiro, H. (1973): Japanese dialects. In: Diachronic, areal and typological linguistics. Edited by H. M. Hoenigswald and R. H. Langacre. 368-400. |
Norwegian
Norwegian dialect data comes from the project conducted at the Department of Linguistics of the Norwegian University of Science and Technology (NTNU) in Trondheim in which dialect speakers were asked to read the fable "The North Wind and the Sun" in their native dialect. All recordings were phonetically transcribed and can be found at http://www.ling.hf.ntnu.no/nos/. Our benchmark database contains transcriptions of 51 manually extracted cognates recorded from 51 speakers.
Original Data: | http://www.ling.hf.ntnu.no/nos/ |
---|---|
Aligned by: | J.-M. List |
Download: | norwegian.zip |
Cite source as: | Alberg, J. and Skarbø, K. (2011): Nordavinden og sola. En norsk dialektprøvedatabase på nettet. URL: http://www.ling.hf.ntnu.no/nos/ |
Ob-Ugrian
The data on this subset of the benchmark database comes from the Global Lexicostatistical Database. It consists of 48 cognate sets distributed over 21 Ob-Ugrian (Uralic) languages. The data was digitized and compiled from different sources by M. Zhivlov (March 2011).
Original Data: | http://starling.rinet.ru/cgi-bin/main.cgi?root=new100&encoding=utf-eng |
---|---|
Compiled by: | M. Zhivlov |
Aligned by: | J.-M. List |
Download: | ob-ugrian.zip |
Cite source as: | Zhivlov, M. (2011): Ob-Ugrian. In: The Global Lexicostatistical Database. Compiling, Clarifying, Connecting Basic Vocabulary Around The World: From Free-Form to Tree-Form. URL: http://starling.rinet.ru/new100/main.htm |
Romance
The data was taken from the Languages & Origins in Europe project at the McDonald Institute for Archaeological Research (2006-2009). The project website offers sound files and phonetic transcriptions of different European language families (Germanic, Romance, Slavic). For our benchmark database we used 30 cognate sets distributed over 8 language varieties in an IPA transcription that was slightly simplified compared to the original.
Original Data: | http://www.languagesandpeoples.com/ |
---|---|
Compiled by: | P. Heggarty |
Aligned by: | J.-M. List |
Download: | romance.zip |
Cite source as: | Heggarty, P. (2007): Languages and Origins in Europe. URL: http://www.languagesandpeoples.com/. |
Sinitic
The data of this subset on Chinese dialects was taken from the Xiàndài Hànyǔ Fāngyán Yīnkù (Phonetic database of Chinese dialects, Hóu (2004)). The original data consists of sound recordings and phonetic transcriptions for 180 concepts translated into 40 Chinese dialect varieties. For the benchmark database, we took but a small set of 20 cognate sets with slightly modified IPA transcriptions.
Original Data: | Hóu (2004) |
---|---|
Aligned by: | J.-M. List |
Download: | sinitic.zip |
Cite source as: | Hóu Jīngyī 侯精一 (2004): Xiàndài Hànyǔ fāngyán yīnkù 现代汉语方言音库 [Phonological database of Chinese dialects]. Shànghǎi: Shànghǎi Jiàoyù 上海教育. |
Slavic
This is but a small collection of 20 cognate sets distributed over 5 Slavic languages. The cognate sets were selected by consulting Derksen (2008). The phonetic transcriptions for the language varieties were taken from standard resources on the languages.
Original Data: | Derksen (2008) |
---|---|
Compiled by: | J.-M. List |
Aligned by: | J.-M. List |
Download: | slavic.zip |
Cite source as: | Derksen, R. (2008): Etymological dictionary of the Slavic inherited lexicon. Leiden and Boston: Brill. |