Farhan Samir

Farhan Samir will speak at the UCSD Linguistics Department colloquium on February 10th at 9:00 a.m. in AP&M 4301.

Charting Data Gaps in Multilingual Web Archives

At least as early as the mid-20th century, researchers like Warren Weaver imagined designing universal machines that could understand any language. They conceived of machines that would interpret linguistic utterances through a universal medium, not unlike UTF-8 byte sequences. However, this universal framing overlooks that languages vary widely in the sociocultural contexts they reflect. I will argue that overlooking these differences has led to emergent Anglocentric biases in practices for collecting the datasets that fuel our current-day consumer language technologies.

In making my argument, I will first describe the InfoGap method for identifying data gaps in a widely-used English corpus, namely English Wikipedia. Applying InfoGap in a large-scale comparative analysis, I will show English Wikipedia contains significant data gaps when compared to the Russian and French language editions. These data gaps are especially prominent when it comes to regional topics outside the Anglosphere. Second, through a case-study on a multilingual transcribed-speech dataset, I will argue that algorithmic scraping and filtering of massively multilingual datasets incurs significant risk of misrepresenting already marginalized language varieties and speaker communities. Finally, I will discuss future research directions that steer away from the universal vision for speech and language technologies, which has inadvertently centered English as the standard of comparison. Rather, I will gear my discussion towards co-designing tasks, archives, and methods that instead center marginalized language varieties.