Linguistic Corpora Collections
Purchased from
- Global Web-Based English (GloWbE) and Corpus of Contemporary American English (COCA) - available ON-CAMPUS only
Corpora from the Linguistic Data Consortium (LDC)
New for 2020
- OnToNotes5 - or contact Janice Adlington directly
- TreeBank-3
- 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News
- 2009 CoNLL Shared Task Part 1 - Spanish+Catalan, Czech, German
- 2009 CoNLL Shared Task Part 2 - Chinese and English trial corpora
- Arabic Gigaword, 5th edition - Data
- CoNLL-2009 Shared Task data for Spanish, Catalan, Czech and German - Data
- Chinese Dependency Treebank 1.0 - Data
- Digital Archive of Southern Speech - Data
- English Gigaword, 5th edition, Disc 1 - Data
- English Gigaword, 5th edition, Disc 2 - Data
- English Gigaword, 5th edition, Disc 3 - Data
- Annotated English Gigaword, 5th edition - Data
- English Web Treebank - Data
- Fisher Spanish, Speech - Disk 1, Disk2
- Fisher Spanish, Transcripts - Data
- French Gigaword, 3rd edition - Data
- Prague Czech-English Dependency Treebank 2.0
- Russian-English Computer Security Parallel Text - Data
- Spanish Gigaword, 3rd edition - Data
- TORGO Database of Dysarthric Articulation - Disk 1, Disk 2, Disk 3, Disk 4
- WTIMIT: The TIMIT Speech Corpus - Data
USC Shoah Foundation Institute - data use license
McMaster also subscribes to the Visual History Archive of the Shoah Foundation Institute which provides online access to the video testimonies.