The Sources of the BDPA


The table below gives an overview of all the sources in the BDHL database. Click on the links in the table in order to view a more detailed description of each source. Detail description includes information on the original source from which the data was taken, link to the original data source, name(s) of the person(s) who compiled and aligned the data, as well as a link where manually edited pairwise and multiple sequence alignments for a given source can be downloaded (along with the unaligned input files).

Dataset Languages
Andean Andean languages (Aymara, Quechua) 619 76 883 20 55
Bai Bai dialects 889 90 1416 17 32
Bulgarian Bulgarian dialects 1515 152 32418 197 48
Dutch Dutch dialects 500 50 3024 62 44
French French dialects 712 76 3810 62 41
Germanic Germanic languages and dialects 1110 111 4775 45 32
Japanese Japanese dialects 219 26 224 10 40
Norwegian Norwegian dialects 501 51 2183 51 46
Ob-Ugrian Uralic languages 444 48 689 21 45
Romance Romance languages 297 30 240 8 37
Sinitic Chinese dialects 200 20 20 40 35
Slavic Slavic languages 120 20 100 5 38


Andean


The data comes from the Sounds of the Andean Languages project (2001-2004) at the university of Sheffield, UK. Main contributor was Paul Heggarty, who conducted the field work and provided the sound recordings and the phonetic transcriptions which are available on the project website. For our benchmark database, we used 76 cognate sets distributed over 21 varieties of Quechua and Aymara.

Original Data: http://www.quechua.org.uk/
Compiled by: Paul Heggarty
Aligned by: J.-M. List
Download: andean.zip
Cite as: Heggarty, P. (2006): Sounds of the Andean languages. URL: http://www.quechua.org.uk.

Bai


The data on the Bai dialects is a compilation of the two independently conducted studies by Wang (2006) and Bryan Allen (2007). From these sources, 90 cognate sets distributed over 17 language varieties were extracted.

Original Data: Wang (2006), Allen (2007)
Aligned by: J.-M. List
Download: bai.zip
Cite source as: Wang, F. (2006): Comparison of languages in contact. The distillation method and the case of Bai. Taipei: Institute of Academia Sinica.
Allen, B. (2007): Bai dialect survey. SIL International. URL: http://www.sil.org/silesr/2007/silesr2007-012.pdf.

Bulgarian


Bulgarian dialect data comes from the Buldialect project (Buldialect - Measuring Linguistic Unity and Diversity in Europe, 2006-2010). This was a joint project between the Eberhard-Karls University Tübingen, the University of Groningen and the Bulgarian Academy of Sciences. Bulgarian data in the BDHL contains the phonetic transcriptions of 152 words collected at 197 sites distributed all over Bulgaria. The data was collected in a such way that it represents the most important phonetic features described in the traditional literature on Bulgarian dialects.

Original Data: http://www.jelenaprokic.eu/buldialect
Aligned by: J. Prokić
Download: bulgarian.zip
Cite as: Prokić, J.; Nerbonne, J.; Zhobov, V.; Osenova, P.; Simov, K.; Zastrow, T. and E. Hinrichs (2009): "The computational analysis of Bulgarian dialect pronunciation". Serdica Journal of Computing 3.3:269—298.

Dutch


Dutch dialect data comes from the Goeman-Taeldeman-Van Reenen project (GTRP, 1980-1995) at the Maartens Institute in the Netherlands. Our benchmark data contains aligned transcriptions of 50 words from the GTRP collected at 62 places.

Original Data: http://www.meertens.knaw.nl/mand/database/
Aligned by: J.-M. List
Download: dutch.zip
Cite as: de Schutter, G., B. van den Berg, T. Goeman, and T. de Jong (2007): MAND. Morfologische Atlas van de Nederlandse Dialecten. URL: http://www.meertens.knaw.nl/mand/database/

French


The data on French dialects comprises a small excerpt of 76 cognate sets distributed over 60 dialect varieties taken from the "Tableaux phonétiques des patois suisses romands". The data was partially digitized at the Heinrich Heine University Düsseldorf under the supervision of Hans Geisler. Our collection is based on a simplified conversion of the original phonetic orthography into IPA.

Original Data: Gauchat et al. (1925)
Digitized by: H. Geisler
Aligned by: J.-M. List
Download: french.zip
Cite source as: Gauchat, L. et al. (1925): Tableaux phonétiques des patois suisses romands. Neuchâtel: Attinger.

Germanic


The data was taken from the Languages & Origins in Europe project at the McDonald Institute for Archaeological Research (2006-2009). The project website offers sound files and phonetic transcriptions of different European language families (Germanic, Romance, Slavic). For our benchmark database we used 111 cognate sets distributed over 45 language varieties (English dialects and Germanic languages) in an IPA transcription that was slightly simplified according to the original.

Original Data: http://www.languagesandpeoples.com/
Compiled by: P. Heggarty
Aligned by: J.-M. List
Download: germanic.zip
Cite source as: Heggarty, P. (2007): Languages and Origins in Europe. URL: http://www.languagesandpeoples.com/.

Japanese


Data was taken from Shirō (1973). The data was digitized by three students from Heinrich Heine University Düsseldorf during a course on historical linguistics. Here, it is given in a form in which the original phonetic transcriptions were slightly modified in order to be in concordance with IPA standard.

Original Data: Shiro (1975)
Digitized by: M. Dickmanns, S. M. Oetzel, and K. Vogt
Aligned by: J.-M. List
Download: japanese.zip
Cite source as: Shiro, H. (1973): Japanese dialects. In: Diachronic, areal and typological linguistics. Edited by H. M. Hoenigswald and R. H. Langacre. 368-400.

Norwegian


Norwegian dialect data comes from the project conducted at the Department of Linguistics of the Norwegian University of Science and Technology (NTNU) in Trondheim in which dialect speakers were asked to read the fable "The North Wind and the Sun" in their native dialect. All recordings were phonetically transcribed and can be found at http://www.ling.hf.ntnu.no/nos/. Our benchmark database contains transcriptions of 51 manually extracted cognates recorded from 51 speakers.

Original Data: http://www.ling.hf.ntnu.no/nos/
Aligned by: J.-M. List
Download: norwegian.zip
Cite source as: Alberg, J. and Skarbø, K. (2011): Nordavinden og sola. En norsk dialektprøvedatabase på nettet. URL: http://www.ling.hf.ntnu.no/nos/

Ob-Ugrian


The data on this subset of the benchmark database comes from the Global Lexicostatistical Database. It consists of 48 cognate sets distributed over 21 Ob-Ugrian (Uralic) languages. The data was digitized and compiled from different sources by M. Zhivlov (March 2011).

Original Data: http://starling.rinet.ru/cgi-bin/main.cgi?root=new100&encoding=utf-eng
Compiled by: M. Zhivlov
Aligned by: J.-M. List
Download: ob-ugrian.zip
Cite source as: Zhivlov, M. (2011): Ob-Ugrian. In: The Global Lexicostatistical Database. Compiling, Clarifying, Connecting Basic Vocabulary Around The World: From Free-Form to Tree-Form. URL: http://starling.rinet.ru/new100/main.htm

Romance


The data was taken from the Languages & Origins in Europe project at the McDonald Institute for Archaeological Research (2006-2009). The project website offers sound files and phonetic transcriptions of different European language families (Germanic, Romance, Slavic). For our benchmark database we used 30 cognate sets distributed over 8 language varieties in an IPA transcription that was slightly simplified compared to the original.

Original Data: http://www.languagesandpeoples.com/
Compiled by: P. Heggarty
Aligned by: J.-M. List
Download: romance.zip
Cite source as: Heggarty, P. (2007): Languages and Origins in Europe. URL: http://www.languagesandpeoples.com/.

Sinitic


The data of this subset on Chinese dialects was taken from the Xiàndài Hànyǔ Fāngyán Yīnkù (Phonetic database of Chinese dialects, Hóu (2004)). The original data consists of sound recordings and phonetic transcriptions for 180 concepts translated into 40 Chinese dialect varieties. For the benchmark database, we took but a small set of 20 cognate sets with slightly modified IPA transcriptions.

Original Data: Hóu (2004)
Aligned by: J.-M. List
Download: sinitic.zip
Cite source as: Hóu Jīngyī 侯精一 (2004): Xiàndài Hànyǔ fāngyán yīnkù 现代汉语方言音库 [Phonological database of Chinese dialects]. Shànghǎi: Shànghǎi Jiàoyù 上海教育.

Slavic


This is but a small collection of 20 cognate sets distributed over 5 Slavic languages. The cognate sets were selected by consulting Derksen (2008). The phonetic transcriptions for the language varieties were taken from standard resources on the languages.

Original Data: Derksen (2008)
Compiled by: J.-M. List
Aligned by: J.-M. List
Download: slavic.zip
Cite source as: Derksen, R. (2008): Etymological dictionary of the Slavic inherited lexicon. Leiden and Boston: Brill.