The curriculum is divided into seven modules: Module 1 is a basic introduction to text analysis and broad text analysis workflows, and the proceeding modules roughly cover different steps or aspects of a text analysis research workflow: gathering textual data (Modules 2.1 and 2.2), cleaning and preparing textual data (Module 3), analyzing textual data (Modules 4.1 and 4.2), and visualizing textual data (Module 5).

For supplementary materials, including set-up and activity files, and instructions on how to use these modules, refer back to the Teaching Materials.

Module 1 Getting Started: Text Analysis with the HathiTrust Research Center 

This module is a basic introduction to text analysis and the research methods and workflows it encompasses. It also introduces the HathiTrust Research Center (HTRC) and the tools and services the Center provides to facilitate large-scale text analysis of the HathiTrust Digital Library (HTDL). The module will train participants to recognize research questions for which text analysis can be used to better support text analysis research, understand broad text analysis workflows to make sense of digital scholarly research practices, and to relate the HTRC to text analysis research as an example for understanding how text analysis tool providers can support digital scholarship.

Module 2.1 Gathering Textual Data: Finding Text

This module introduces the options available to researchers for accessing textual data. In addition to discussing the variety of textual data providers, the lesson covers the process of building a text corpora in the HTDL interface and importing it into HTRC Analytics for analysis. Upon completion of this module, participants will be able to differentiate the various ways textual data can be gathered to make recommendations for researchers, evaluate textual data providers based on research needs to provide reference to researchers, and curate and select volumes to construct their own HTRC workset in order to gain experience building corpora.

Module 2.2 Gathering Textual Data: Bulk Retrieval

This module covers common methods for gathering textual data in an automated way that allows them to download text in bulk, including web scraping, APIs, and file transfers. It also introduces the command line interface and the command line tool wget for transferring files from a server. The module will help participants understand why automated access is valuable for building textual datasets to facilitate researcher needs around digital scholarship, and train them to practice using APIs and execute basic commands from the command line interface to gain confidence with computationally-intensive research.

Module 3 Working with Textual Data

In order to do text analysis, a researcher needs some proficiency in wrangling and cleaning textual data. This module addresses the skills needed to prepare text for analysis after it has been acquired. At the end of this module, participants will be able to distinguish cleaning and preparing as one step in the text analysis workflow, recognize key strategies for preparing data to make recommendations to researchers, and run Python scripts from the command line to gain experience with the utility of Python for working with data.

Module 4.1 Analyzing Textual Data: Using Off-the-Shelf Tools

This module introduces how textual data can be analyzed by using off-the-shelf, pre-built tools. It will discuss the advantages and constraints of web-based text analysis tools and programming solutions, introduce basic text analysis algorithms available in the HTRC algorithms, and demonstrate how to select, run, and view the results of the topic modeling algorithm.The module will train participants to make assessments of web-based text analysis tools and programming solutions in relation to researcher questions and requests, and to match appropriate tools to research problems and distinguish different approaches to text analysis so they can suggest options for researchers. It will also guide participants in using web-based tools to gain experience with off-the-shelf solutions text mining, and in evaluating the results of running a text analysis algorithm to build confidence with the outcomes of data-intensive research.

Module 4.2 Performing Text Analysis: Basic Approaches with Python 

More advanced researchers will prefer to conduct text analysis outside of pre-built, off-the-shelf tools, opting instead for a toolkit of command line programs and custom code. This module will review text analysis strategies for the more advanced researcher and explore text analysis methods in more depth. It introduces the concept of programming packages and provides hands-on experience with running Python code to analyze Extracted Features files from the HTRC Extracted Features dataset. Upon completion of the module, participants will be able to identify the needs of advanced text mining researchers in order to make skill-appropriate recommendations, recognize text analysis methods to understand the kinds of research available in the field, and successfully interact with a pre-defined textual dataset in order to practice programming skills for data-driven research.

Module 5 Visualizing Textual Data: An Introduction 

This modules introduces the basics of data visualization, with a focus on visualizing textual data. It will also demonstrate how to use the HathiTrust+Bookworm interface to visualize word usage over time. At the end of this module, participants will be able to recognize common types of data visualizations to communicate with researchers about their options, and develop experience in reading data visualizations by exploring results in HathiTrust+Bookworm and making connections using available data and data points.

Follow-on Instructional Content

This curriculum was updated in fall 2019 as part of a project extension. Curriculum was generalized and streamlined, and new case studies and activities were added. Access the updated curriculum on Follow-on Instructional Content page.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.