Abstract
Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve "free" translations. In this paper we explore four possible filters: The Damerau-Levenshtein distance between POS-Tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.
Original language | English |
---|---|
Pages (from-to) | 147-161 |
Number of pages | 15 |
Journal | Linguistics in the Netherlands |
Volume | 36 |
Issue number | 1 |
DOIs | |
Publication status | Published - 5 Nov 2019 |
Externally published | Yes |