Course Description

Walter Daelemans (Antwerpen), advanced, 4 hours

Computational Stylometry

Some of the language variation in texts can be linked to psychological and sociological properties of their authors (personality, mental health, age, gender, education level, region, ...). Assuming that this link is robust and invariant, techniques can be developed to determine these psychological and sociological variables on the basis of text. This "computational stylometry" gives rise to interesting applications such as gender and age detection from text, prediction of onset of Alzheimer's disease from writing etc. Ultimately, this leads to the hypothesis of the existence of a "human stylome", a distribution of linguistic properties uniquely defining an individual author and making possible reliable authorship attribution. Applications range from disputed authorship in literary studies to forensic applications, marketing etc.In this course, we presuppose basic knowledge of computational linguistics and machine learning, and explain the dominant approach to computational stylometry, based on a machine learning approach to text categorization combined with linguistic analysis of documents. We provide an overview of achievements and issues in authorship attribution and other applications of computational stylometry. More specifically, we focus on a number of problematic cases: computational stylometry in applications with small amounts of training data, authorship attribution with many potential authors, computational stylometry on chat language, and finally on the question how author (group) characteristics can be distinguished from other factors leading to linguistic variation: topic, register, and genre.References:

Robert Dale (Macquarie), intermediate, 8 hours

Automated Writing Assistance: Grammar Checking and Beyond

We all write, and most of us think we could write better than we do. In this course, we look at how automated techniques developed in natural language processing might be put to use to help people write better. We cover existing fielded technologies for spelling, grammar and style checking, and look at the kinds of techniques that might be required for specific audiences such as second-language learners, and for forms of writing assistance concerned with issues beyond those which are sentence-internal.


Ralph Grishman (New York), intermediate, 8 hours

Information Extraction

Information extraction is the process of creating semantically structured information from unstructured text. We will present methods for identifying and classifying names and other textual references to entities; for capturing semantic relations; and for recognizing events and their arguments. We will consider hand-coded rules and various machine learning approaches, including fully-supervised learning, semi-supervised learning, and distant learning. (Basic machine learning concepts will be reviewed, but some prior acquaintance with machine learning methods or corpus-trained language models will be helpful.) Applications will be presented briefly in this course and in greater depth in the Text Mining course.


Daniel Jurafsky (Stanford), introductory/advanced, 8 hours

Computational Extraction of Social and Interactional Meaning

This course introduces methods for extracting social or affective meaning from text and speech. We will cover literature on most of the classes of affective meaning in Scherer's standard typology: sentiment, i.e. enduring, affectively colored preferences or attitudes (likes, dislikes), personality traits (extroverted, anxious, agreeable), brief evaluative episodes like emotions (anger, joy, shame), and interpersonal stances taken toward another person in a conversation (friendly, cold, flirtatious). We will analyze the linguistic cues from text and speech (lexical, dialog act, prosodic, spectral) that signal these kinds of meaning.


Chin-Hui Lee (Georgia Tech), intermediate, 8 hours

A Short Course on Digital Speech Processing and Applications

Speech is the most natural means of communication among humans. It also plays a critical role in enhancing human-machine communication. In this course, we attempt to cover all fundamental aspects of digital speech processing, including both theoretical and practical topics, starting with the acoustics of speech sounds, followed by speech analysis and parameter extraction, speech modeling, theory of linear prediction and hidden Markov models. Finally speech applications, including speech coding, synthesis, recognition and verification, will also be introduced. The linkage to acoustics and language processing will also be discussed, including topics on language modeling and microphone arrays. MATLAB demos will be used in class for illustration. Some homework exercises will also be provided for after-class learning.


Intended Audience: This short course is intended for researchers, engineers and professionals who are starting speech-related work and interested in more basic knowledge in digital speech processing, or those who are already involved in speech technology development and would like to learn more fundamentals. The course is designed with a broad coverage of all areas related to digital speech processing with linkages to language and acoustics.


Yuji Matsumoto (Nara), introductory/intermediate, 8 hours

Syntax and Parsing: Phrase Structure and Dependency Parsing

This course introduces grammatical formalisms of natural language and parsing algorithms. Several phrase structure based grammar formalisms, such as Head-driven Phrase Structure Grammar, Lexical Functional Grammar, Categorial Grammar and Tree Adjoining Grammar, are briefly introduced, and phrase structure based syntactic representaion and dependency based syntactic representation are discussed comparatively. Then, parsing algorithms for phrase structure grammars and those for word dependency structure are introduced. This course also covers recent advances of statistical dependency parsing algorithms.


Diana Maynard (Sheffield), introductory/intermediate, 8 hours

Text Mining

This course gives an introduction to tools and techniques for text mining and analysis, building on many of the modules taught in the Information Extraction course to show how information objects identified from text can be used to sift through and make sense of large volumes of data. It includes material on semantic web technologies, opinion mining, semantic search, performance evaluation and multilingual issues. The techniques will be illustrated with a number of real-world applications, many of them based on the GATE architecture for language processing.


Please note that if you are planning to follow any of the Text Mining modules taught by Diana Maynard, there will be some time allocated for hands-on experimentation with GATE, to help your understanding of the material. It is therefore highly recommended to download GATE in advance of the course and to bring a laptop with GATE installed on it. You do NOT, however, need any previous experience with GATE.
GATE is freely available and can be downloaded from It is important that you install version 6.1 or later (i.e. the latest stable release, or the latest nightly build). If you already have an earlier version installed, please uninstall it and install the latest version, in order to avoid conflicts. It is also important that you have Java 6 installed on your computer, as GATE requires this.