Introduction to the BDPA

With the "Benchmark Database for Phonetic Alignments" (BDPA), we present a new data resource for historical linguistics and dialectology which offers collections of cognate words from different language varieties. In contrast to other resources which concentrate on questions of cognacy and lexical change, the BDPA represents the data in form of pairwise and multiple alignments. An alignment is a matrix representation of two or more sequences in which corresponding segments in the sequences are placed in the same column, with empty cells resulting from non-matching segments being filled by gap symbols. Currently, the BDPA offers a total of 750 multiple alignments, covering eight language families, more than 500 different language varieties, and more than 50 000 words.

In the last two decades, automatic alignment analyses of phonetic strings have become an important tool in quantitative language comparison. Phonetic alignment plays a crucial role in the identification of regular sound correspondences and deeper genealogical relations between and within language families. Surprisingly, up to today, there are no easily accessible benchmarks for phonetic alignment analyses. Here we present a publicly available benchmark database of manually edited phonetic alignments which is designed as a platform to test the performance of automatic alignment algorithms. The database consists of a great variety of alignments drawn from a number of different sources. The data is arranged in a such way that typical problems encountered in phonetic alignment analyses (metathesis, splits and mergers of sounds, diversity of phonetic strings) are represented and can be directly tested.

Release: 1.0
Date: June 5, 2014