Frequently Asked Questions
- What are alignments?
- What formats do you use to display and share the alignments?
- How do I cite the BDPA?
- Which sources did you use?
- Whom should I contact if I have additional questions or want to contribute?
Alignment Analyses
Alignment analyses are the most common way to compare sequences. Given that phonetic sequences are the basic comparanda in both historical linguistics and dialectology, it is therefore straightforward to assume that alignment analyses play a crucial role in both disciplines. Without alignments, i.e., without the explicit matching of sounds, neither could regular sound correspondences be detected nor could cognacy between words or genetic relationship between languages be proven. However, although language comparison is always based on an implicit alignment of words, it is rarely explicitly visualized or termed as such, and in the rare cases where scholars explicitly use alignments to visualize correspondence patterns in words, it merely serves illustrational purposes.
Basic Formats for Alignments Analyses
In order to exchange, edit, and compare phonetic alignments, different formats are used in the BDPA. Basically, we distinguish between formats for pairwise alignments and for multiple alignments. For practical reasons, the BDPA uses the alignment formats generally employed in LingPy. All formats are text-based and can be edited with help of simple text editors.
The basic format for the
representation of multiple alignment analyses is the MSA-format. Files in this format have the
extension "msa"
. The first line of an MSA file
serves as an identifier for the dataset from which the alignment was taken. There are no further
format restrictions and the user can freely decide what to use as an identifier, as long as it does
not exceed the first line. In the BDPA, we use the names of our subsets
as dataset identifiers. The second line is reserved as an identifier for the set of
aligned sound sequences. The identifier can again be freely chosen by the user. In the BDPA, we
generally use the meaning of the sound sequences as identifier, but we also add additional
information, such as the anceestral from (in language families) or the orthography of the
corresponding word in the standard variety (in dialect datasets). The following lines give the
phonetic sequences in aligned form, separated by a tab-stop, and preceded by language identifiers
(ISO-code, language name, dialect location) in the first column of the alignment matrix. The hash
symbol ("#"
) is used as a comment character. When placed in the beginning of a line, it indicates
that the line should be ignored when parsing the file . Inspired from
alignment formats in bioinformatics, LingPy allows for specific additional lines which can be used
to annotate the alignments. Instances of metathesis, for example, may be represented by adding a
line which starts with the keyword "SWAPS"
, with a plus character ("+"
) marking the beginning of a
swapped region, the dash character ("-"
) its center and another plus character the end. All sites
which are not affected by swaps contain a dot ("."
). In the BDPA, 66 out of
750 multiple alignments contain instances of metathesis and are regularly annotated in the way just
described.
As an
example, consider the file harry_potter.msa:
1 Harry Potter Testset 2 Woldemort (in different languages) 3 English v o l - d e m o r t 4 German. w a l - d e m a r - 5 Russian v - l a d i m i r - 6 SWAPS.. . + - + . . . . . .
Basically, the MSA-format can also be used to represent pairwise alignment analyses. However, since each MSA-file, is a single text-file, we would need 7 197 different text-files to represent all sequence pairs of our master benchmark for pairwise alignment analyses. Using such a large amount of text-files to represent the rather small amount of information available in pairwise alignments is not only impractical as a shared digital resource, but also very inefficient for computation.
In order to deal with large amounts of pairwise alignments in one and the same text-file, LingPy offers
an additional format for pairwise alignment analyses.
This format is called PSA-format, and files in the format have the extension "psa"
. As for the MSA-format,
the first line of a PSA-file is reserved for an identifier that refers to the dataset from which the data
was taken. The sequence pairs themselves are given
in triplets, with a sequence identifier in the first line
of a triplet (containing the meaning, or orthographical
information) and the
two sequences in the second and third line contain the alignment
matrix with the language identifiers being placed in
the first column. All triplets (sequence pair identifier
and two sequences) are separated by one empty line. As an example, consider the file harry_potter.psa:
1 Harry Potter Testset 2 Woldemort in German and Russian 3 German. w a l - d e m a r 4 Russian v - l a d i m i r 5 6 Woldemort in English and Russian 7 English w o l - d e m o r t 8 Russian v - l a d i m i r - 9 10 Woldemort in English and German 11 English w o l d e m o r t 12 German. w a l d e m a r - 13
In the BDPA, the pairwise benchmarks, as described above, are provided in PSA-format. Additionally, we extracted all possible pairwise alignments inherent in our master set of 750 multiple alignments and offer them for download in PSA-format. You can download both MSA and PSA files for each subset from here.
Citing BDPA
If you use this database, please cite the following paper:
- List, Johann-Mattis and Jelena Prokić. (2014). A benchmark database of phonetic alignments in historical linguistics and dialectology. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC), 26 — 31 May 2014, Reykjavik. 288-294.
The paper can be downloaded from this link. Please make sure that you also cite all individual sources of BDPA which you are using. For example, if you use the alignments of the Bai dialects in BDPA, you should quote both original sources from which they were taken, namely:
- Wang, F. (2006): Comparison of languages in contact. The distillation method and the case of Bai. Taipei: INstitue of Linguistics Academia Sinica.
- Allen, B. (2007): Bai dialect survey. SIL International. ULR: http://www.sil.org/silesr/2007/silesr2007-012.pdf
Sources
All the sources we used to create the alignments can be found here.
Contact
For technical questions regarding the data, please contact Johann-Mattis List (Philipps-Universität Marburg) or Jelena Prokić (Philipps-Universität Marburg).