BDPA

The Sources of the BDPA

The table below gives an overview of all the sources in the BDHL database. Click on the links in the table in order to view a more detailed description of each source. Detail description includes information on the original source from which the data was taken, link to the original data source, name(s) of the person(s) who compiled and aligned the data, as well as a link where manually edited pairwise and multiple sequence alignments for a given source can be downloaded (along with the unaligned input files).

Dataset	Languages	PSA number of pairwise alignments	MSA number of multiple alignments	NoW number of words	NoL number of languages	PID average percentage identity
Andean	Andean languages (Aymara, Quechua)	619	76	883	20	55
Bai	Bai dialects	889	90	1416	17	32
Bulgarian	Bulgarian dialects	1595	152	32418	197	48
Dutch	Dutch dialects	500	50	3024	62	44
Dutch_RND	Dutch Dialects	1651	166	50046	363	45
French	French dialects	712	76	3810	62	41
Germanic	Germanic languages and dialects	1110	111	4775	45	32
Japanese	Japanese dialects	219	26	224	10	40
Norwegian	Norwegian dialects	501	51	2183	51	46
Ob-Ugrian	Uralic languages	444	48	689	21	45
Romance	Romance languages	297	30	240	8	37
Sinitic	Chinese dialects	200	20	20	40	35
Slavic	Slavic languages	120	20	100	5	38

Andean

The data comes from the Sounds of the Andean Languages project (2001-2004) at the university of Sheffield, UK. Main contributor was Paul Heggarty, who conducted the field work and provided the sound recordings and the phonetic transcriptions which are available on the project website. For our benchmark database, we used 76 cognate sets distributed over 21 varieties of Quechua and Aymara.

Original Data:	http://www.quechua.org.uk/
Compiled by:	Paul Heggarty
Aligned by:	J.-M. List
Download:	andean.zip
Cite as:	Heggarty, P. (2006): Sounds of the Andean languages. URL: http://www.quechua.org.uk.

Bai

The data on the Bai dialects is a compilation of the two independently conducted studies by Wang (2006) and Bryan Allen (2007). From these sources, 90 cognate sets distributed over 17 language varieties were extracted.

Original Data:	Wang (2006), Allen (2007)
Aligned by:	J.-M. List
Download:	bai.zip
Cite source as:	Wang, F. (2006): Comparison of languages in contact. The distillation method and the case of Bai. Taipei: Institute of Academia Sinica. Allen, B. (2007): Bai dialect survey. SIL International. URL: http://www.sil.org/silesr/2007/silesr2007-012.pdf.

Bulgarian

Bulgarian dialect data comes from the Buldialect project (Buldialect - Measuring Linguistic Unity and Diversity in Europe, 2006-2010). This was a joint project between the Eberhard-Karls University Tübingen, the University of Groningen and the Bulgarian Academy of Sciences. Bulgarian data in the BDHL contains the phonetic transcriptions of 152 words collected at 197 sites distributed all over Bulgaria. The data was collected in a such way that it represents the most important phonetic features described in the traditional literature on Bulgarian dialects.

Original Data:	http://www.jelenaprokic.eu/buldialect
Aligned by:	J. Prokić
Download:	bulgarian.zip
Cite as:	Prokić, J.; Nerbonne, J.; Zhobov, V.; Osenova, P.; Simov, K.; Zastrow, T. and E. Hinrichs (2009): "The computational analysis of Bulgarian dialect pronunciation". Serdica Journal of Computing 3.3:269—298.

Dutch

Dutch dialect data comes from the Goeman-Taeldeman-Van Reenen project (GTRP, 1980-1995) at the Maartens Institute in the Netherlands. Our benchmark data contains aligned transcriptions of 50 words from the GTRP collected at 62 places.

Original Data:	http://www.meertens.knaw.nl/mand/database/
Aligned by:	J.-M. List
Download:	dutch.zip
Cite as:	de Schutter, G., B. van den Berg, T. Goeman, and T. de Jong (2007): MAND. Morfologische Atlas van de Nederlandse Dialecten. URL: http://www.meertens.knaw.nl/mand/database/

French

The data on French dialects comprises a small excerpt of 76 cognate sets distributed over 60 dialect varieties taken from the "Tableaux phonétiques des patois suisses romands". The data was partially digitized at the Heinrich Heine University Düsseldorf under the supervision of Hans Geisler. Our collection is based on a simplified conversion of the original phonetic orthography into IPA.

Original Data:	Gauchat et al. (1925)
Digitized by:	H. Geisler
Aligned by:	J.-M. List
Download:	french.zip
Cite source as:	Gauchat, L. et al. (1925): Tableaux phonétiques des patois suisses romands. Neuchâtel: Attinger.

Germanic

The data was taken from the Languages & Origins in Europe project at the McDonald Institute for Archaeological Research (2006-2009). The project website offers sound files and phonetic transcriptions of different European language families (Germanic, Romance, Slavic). For our benchmark database we used 111 cognate sets distributed over 45 language varieties (English dialects and Germanic languages) in an IPA transcription that was slightly simplified according to the original.

Original Data:	http://www.languagesandpeoples.com/
Compiled by:	P. Heggarty
Aligned by:	J.-M. List
Download:	germanic.zip
Cite source as:	Heggarty, P. (2007): Languages and Origins in Europe. URL: http://www.languagesandpeoples.com/.

Japanese

Data was taken from Shirō (1973). The data was digitized by three students from Heinrich Heine University Düsseldorf during a course on historical linguistics. Here, it is given in a form in which the original phonetic transcriptions were slightly modified in order to be in concordance with IPA standard.

Original Data:	Shiro (1975)
Digitized by:	M. Dickmanns, S. M. Oetzel, and K. Vogt
Aligned by:	J.-M. List
Download:	japanese.zip
Cite source as:	Shiro, H. (1973): Japanese dialects. In: Diachronic, areal and typological linguistics. Edited by H. M. Hoenigswald and R. H. Langacre. 368-400.

Norwegian

Norwegian dialect data comes from the project conducted at the Department of Linguistics of the Norwegian University of Science and Technology (NTNU) in Trondheim in which dialect speakers were asked to read the fable "The North Wind and the Sun" in their native dialect. All recordings were phonetically transcribed and can be found at http://www.ling.hf.ntnu.no/nos/. Our benchmark database contains transcriptions of 51 manually extracted cognates recorded from 51 speakers.

Original Data:	http://www.ling.hf.ntnu.no/nos/
Aligned by:	J.-M. List
Download:	norwegian.zip
Cite source as:	Alberg, J. and Skarbø, K. (2011): Nordavinden og sola. En norsk dialektprøvedatabase på nettet. URL: http://www.ling.hf.ntnu.no/nos/

Ob-Ugrian

The data on this subset of the benchmark database comes from the Global Lexicostatistical Database. It consists of 48 cognate sets distributed over 21 Ob-Ugrian (Uralic) languages. The data was digitized and compiled from different sources by M. Zhivlov (March 2011).

Original Data:	http://starling.rinet.ru/cgi-bin/main.cgi?root=new100&encoding=utf-eng
Compiled by:	M. Zhivlov
Aligned by:	J.-M. List
Download:	ob-ugrian.zip
Cite source as:	Zhivlov, M. (2011): Ob-Ugrian. In: The Global Lexicostatistical Database. Compiling, Clarifying, Connecting Basic Vocabulary Around The World: From Free-Form to Tree-Form. URL: http://starling.rinet.ru/new100/main.htm

Romance

The data was taken from the Languages & Origins in Europe project at the McDonald Institute for Archaeological Research (2006-2009). The project website offers sound files and phonetic transcriptions of different European language families (Germanic, Romance, Slavic). For our benchmark database we used 30 cognate sets distributed over 8 language varieties in an IPA transcription that was slightly simplified compared to the original.

Original Data:	http://www.languagesandpeoples.com/
Compiled by:	P. Heggarty
Aligned by:	J.-M. List
Download:	romance.zip
Cite source as:	Heggarty, P. (2007): Languages and Origins in Europe. URL: http://www.languagesandpeoples.com/.

Sinitic

The data of this subset on Chinese dialects was taken from the Xiàndài Hànyǔ Fāngyán Yīnkù (Phonetic database of Chinese dialects, Hóu (2004)). The original data consists of sound recordings and phonetic transcriptions for 180 concepts translated into 40 Chinese dialect varieties. For the benchmark database, we took but a small set of 20 cognate sets with slightly modified IPA transcriptions.

Original Data:	Hóu (2004)
Aligned by:	J.-M. List
Download:	sinitic.zip
Cite source as:	Hóu Jīngyī 侯精一 (2004): Xiàndài Hànyǔ fāngyán yīnkù 现代汉语方言音库 [Phonological database of Chinese dialects]. Shànghǎi: Shànghǎi Jiàoyù 上海教育.

Slavic

This is but a small collection of 20 cognate sets distributed over 5 Slavic languages. The cognate sets were selected by consulting Derksen (2008). The phonetic transcriptions for the language varieties were taken from standard resources on the languages.

Original Data:	Derksen (2008)
Compiled by:	J.-M. List
Aligned by:	J.-M. List
Download:	slavic.zip
Cite source as:	Derksen, R. (2008): Etymological dictionary of the Slavic inherited lexicon. Leiden and Boston: Brill.