Coralie Cram and I have been trying to figure out how many tokens are a safe minimum for the corpora we are investigating. I have an ongoing project on comparative phonetics of Australian languages and one of the issues we continually come across is how much data we need, and whether we should include corpora with relatively small amounts of data, or should stick to the largest corpora. Of course, we understand that the answer is likely to be “it depends” – it depends on the clarity of recordings and the number of speakers, for example. But we should still be able to come up with some general guidelines: is 100 tokens enough, for example? Or does it need to be more like 500? This poster has some preliminary findings, using data from my Bardi corpus. Here’s a link to the full poster.
In the poster, we take two datasets for Bardi: the wordlist data that was used for the 2012 Bardi JIPA sketch (DOI: https://doi.org/10.1017/S0025100312000217) and a set of narrative recordings. The narratives are much more like the general purpose field collections that we are working with form archives, while the wordlist data is smaller but higher quality recordings and more careful speech.
Methods are on the poster, but in brief they involve taking different subsets of the full dataset and using the Kolgorov-Smirnov test to evaluate differences in sample means and variance. We looked at short vowels and made no attempt to remove mistracked formants or outliers.
We found that for the wordlist data, we needed more than about 50% of the data (so, about 400 tokens across 4 vowels) to replicate the sample characteristics of the full dataset. For the narrative data, it was about 30% of the data (more like 2300 tokens), but the variance was much higher.
References
Arnold, T. B., Emerson, J. W., & worldwide, R. C. T. and contributors. (2022). dgof: Discrete Goodness-of-Fit Tests (1.4). https://cran.r-project.org/web/packages/dgof/index.html
Bombien, L., & Winkelmann, R. (2023). Formants and their bandwidths. https://cran.r-project.org/web/packages/wrassp/vignettes/wrassp_intro.html
Dockum, R., & Bowern, C. (2019). Swadesh wordlists are not long enough. Language Documentation and Description, 16. https://doi.org/10.25894/ldd112
Stanley, J. A., & Sneller, B. (2023). Sample size matters in calculating Pillai scores. The Journal of the Acoustical Society of America, 153(1), 54–67. https://doi.org/10.1121/10.0016757
Whalen, D. H., & McDonough, J. (2015). Taking the Laboratory into the Field. Annual Review of Linguistics, 1(1), 395–415. https://doi.org/10.1146/annurev-linguist-030514-124915
Whalen, D. H., DiCanio, C., & Dockum, R. (2022). Phonetic documentation in three collections: Topics and evolution. Journal of the International Phonetic Association, 52(1), 95–121. https://doi.org/10.1017/S0025100320000079