Before you can undertake automated text analysis, it's necessary to obtain a corpus of digitized texts and, in many instances, take steps to prepare them for further processing. This hands-on digital humanities workshop focuses on the technical dimensions of corpus development. We will explore:
- the risks and benefits of optical character recognition (OCR)
- file formatting and naming issues
- organization strategies for large corpora
- problems of data cleaning and preparation
- common sources for textual research data; and
- common legal concerns around the use of textual corpora.
This workshop is open to all graduate students and is offered for Responsible Conduct of Research (RCR) credit as GS717.08. Priority registration will be given to students who intend to receive RCR credit.
Zoom information will be emailed to participants in advance of the session. For more information, contact Will Shaw, Digital Humanities Consultant firstname.lastname@example.org
LocationOnline via Zoom.
- Professional Development
- Professionalism and Scholarly Integrity
- Responsible Conduct of Research