Automated Content Analysis: special training session

15th January 2018 10:30 to 16:00

About the workshop

Qualitative data, such as essays and free response questions in surveys, are rich sources of psychological, social and behavioural information. Yet such information has traditionally been impossible to leverage at a large scale. Recent advances in computational linguistics and machine learning have produced automatic content analysis tools, which can now be applied to a wide number of settings, including the open responses collected longitudinally within a large national birth cohort study.

In a new project funded by the Economic and Social Research Council, we are applying such tools to newly transcribed essays that were written by cohort members of the National Child Development Study (NCDS), when they were age 11 in 1969 ("Imagine you are now 25 years old…”). The responses provide a largely untapped source of psychological and behavioural information that can be linked longitudinally to outcomes for the same individuals.

A new dataset containing the fully transcribed text for 10,500 of these essays will be released by the UK Data Service in mid-February 2018, and will be available for researchers worldwide to download and analyse.

To enable researchers to make the most of this new data release, the Centre for Longitudinal Studies is offering the exciting opportunity to attend a specialised tutorial on automated content analysis, provided by H. Andrew Schwartz, faculty of the Computer Science Department and Center for Computational Social Science at Stony Brook University, New York.  

The Differential Language Analysis ToolKit

DLATK (Differential Language Analysis ToolKit)is an end to end language analysis software, specifically suited for social media and social scientific research applications. It has been used for research published in over 40 peer-reviewed papers across psychology, computer science, public health, medicine, and political science.  Although the heart of DLATK is a Python library it is typically used through a vestaile command interface (requiring no programming).

This tutorial will cover the fundamentals of automated content analysis using DLATK:

  1. The ingredients of automatic content analysis
  2. Differential language analysis
    (Linguistic insights into psychosocial phenomena)
  3. Predictive analytics
    (Machine and statistical learning using text data)

Please note that registration for this event will close on 8 January 2018. To book your place via Eventbrite, please select the 'book now' button at the top of this page.

About the instructor

H. Andrew Schwartz is part of the faculty of the Computer Science Department and Center for Computational Social Science at Stony Brook University, New York. He was previously Lead Research Scientist for the interdisciplinary “World Well-Being Project” at the University of Pennsylvania where he created the Differential Language Analysis ToolKit. 


To participate in the tutorial you will need a laptop with which you can connect to the Internet (Windows PC, Mac, or Linux PC -- all ok). The training venue provides a free Wifi connection for the day, and power to keep your laptop charged.

During the tutorial, participants will connect to a computer server which already has the analysis software (DLATK) installed. After you register, you will receive instructions on how access this server and test a basic command which needs to be completed 24 hours before the tutorial.

Desirable but non-essential expertise:

  • an entry-level understanding (or higher) of quantitative research methods
  • basic scripting (R, Python, or syntax/code in SPSS/SAS/STATA)

Recommended pre-reading

Differential Language Analysis ToolKit:


Schwartz, H. A., Giorgi, S., Sap, M., Crutchley, P., Ungar, L., & Eichstaedt, J. (2017). DLATK: Differential Language Analysis ToolKit. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 55-60). Pdf

Kern, M. L., Park, G., Eichstaedt, J. C., Schwartz, H. A., Sap, M., Smith, L. K., & Ungar, L. H. (2016). Gaining insights from social media language: Methodologies and challenges.

Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political analysis, 21(3), 267-297.

Schwartz, H. A., & Ungar, L. H. (2015). Data-driven content analysis of social media: a systematic overview of automated methods. The ANNALS of the American Academy of Political and Social Science, 659(1), 78-94.


If you have any queries or require further information, please contact Geeting Wong (