Course Description


Jont B. Allen (U Illinois Urbana-Champaign), [introductory, 8 hours]

A Parametric Analysis of Speech Perception

In this series of lectures, we shall study speech perception and the perceptual space of consonants. First we will address the auditory system, including the cochlea and the early auditory pathway. Next we explore auditory and phoneme feature spaces that define the plosive (p,t,k,b,d,g) and fricative (s,S,f,t,T,z,Z) consonants. For example, what distinguishes /t/ from /d/ and /p/ or /s/ from /S/? Knowing the features that define the consonants is critical to improving speech coding and speech recognition software, or to understand hearing aid signal processing. The information presented in these lectures is based on research by the author and his students, as described in the references given below. Issues rarely addressed will include the dynamic range of the auditory system and of speech, and the phoneme error rate as a function of the signal to noise ratio. Software will be provided so that the students may modify speech sounds themselves. This will require the students to have their own PC, running Matlab.

Hours | discussion/topic

  1. Basics on the auditory system (installing Beren in Matlab)
  2. The articulation index and its derivatives (e.g. SII)
  3. Information theoretical measures of speech perception
  4. Background on speech features: Haskins to the present
  5. Plosive features (demos, hands on practice with Beren)
  6. Fricative features, extended demos: using Beren in Matlab
  7. Manipulation of sentences with Kunlun (in Matlab)
  8. Speech perception and the hearing impaired

References:


Hervé Bourlard (Idiap Research Institute, Martigny), [introductory/intermediate, 6 hours]

Automatic Speech Recognition and Multilingual Speech Processing: HMM, Hybrid HMM/ANN and Posterior-based Systems

After reviewing current state-of-the-art in HMM-based automatic speech recognition (ASR), we will discuss hybrid systems using HMM and artificial neural networks (ANNs), as well as the current trends towards using phone and subword unit posterior distributions (also often referred to as “categorical distributions”) in new types of HMMs, or directly as new HMM features.

In the second part of our course, we will then focus on new trends in multilingual speech processing, including multilingual speech recognition, multilingual speech synthesis, and the convergence between the two. Indeed, over the last decade, ASR and TTS technologies have shown a convergence towards statistical parametric approaches. And we believe that properly addressing complex multilingual ASR and TTS tasks (including for low-resourced languages), with the goal to improve the robustness and quality of both speech recognition and speech synthesis systems, will require looking at those problems in such an integrated way.

As part of the most advanced topics, one of the objectives of the present course is thus to investigate multiple, related facets of the multilingual ASR and TTS problems, mainly focusing on the key aspects of cross-language and speaker adaptation, while also primarily focusing on those approaches that aim at reducing the gap between speech recognition and speech synthesis.

This course will assume some minimum knowledge in statistical pattern processing and speech signal processing.

References:


Marcello Federico (Bruno Kessler Foundation, Trento), [introductory/intermediate, 6 hours]

Statistical Machine Translation

Statistical machine translation is nowadays among the most popular and active research fields in natural language processing. This crash course offers a general introduction to the problem and applications of machine translation, followed by five lectures focusing on core techniques and approaches of statistical machine translation. References to open source software, language resources and benchmarks will be also given to let interested students put into practice what acquired during the course.

Course outline:

References:


Giuseppe Riccardi (Trento), [intermediate/advanced, 6 hours]

Spoken Language Understanding

Spoken language understanding (SLU) investigates human-machine and human-human communication by leveraging technologies from signal processing, pattern recognition, machine learning and artificial intelligence. SLU systems are designed to extract the meaning from speech utterances and their applications are vast, from conversational agents (or companions) to meeting summarization and speech and language analytics. In these lectures, we will define the problem of speech understanding, current grammar-based and data-driven models as well as the type of semantic structures used in the latest advanced SLU systems. In the last part, we will review current research challenges and SLU system case studies.

References:


Noah A. Smith (Carnegie Mellon), [intermediate, 9 hours]

Probability and Structure in Natural Language Processing

This course covers key ideas at the junction of natural language processing (NLP) and machine learning. The goal is to make it easier for NLP researchers to follow relevant research in machine learning, and to contribute to the growing body of research that uses advanced statistical modeling techniques to solve hard language processing problems. The tutorial breaks down into three main parts.

Probabilistic Graphical Models. Probabilistic graphical models are a major topic in machine learning. They provide a foundation for statistical modeling of complex data, and starting points (if not full-blown solutions) for inference and learning algorithms. They generalize many familiar methods in NLP. We'll cover Bayesian networks, Markov networks, the relationship between them, and present inference as the central question when working with graphical models.

Linear structure models. Most problems in linguistic analysis are currently solved by applying discrete optimization techniques (dynamic programming, search, and others) to identify a structure that maximizes some score, given an input. We describe a few ways to think about the problem of prediction itself (a kind of inference), and review key approaches to learning structured prediction models. An emphasis will be placed on unifying a wide range of approaches (generative models, conditional models, structured perceptron, structured max margin).

Incomplete data. Since we will never have as much annotated linguistic data as we'd like in all the languages, domains, and genres for which we'd like to do NLP, semisupervised and unsupervised learning have become hugely important. We show how the foundations from the first two parts can be extended to provide a framework for learning with incomplete data. We’ll review Expectation-Maximization in light of what we have covered so far and discuss recently proposed Bayesian techniques.

References:


Bayya Yegnanarayana (International Institute of Information Technology Hyderabad), [introductory/intermediate, 8 hours]

Speech Signal Processing

The objective of this course is to introduce the basics of processing speech signals to extract features of speech production for various applications such as speech recognition, speaker recognition and speech enhancement. No prior background of speech or signal processing is required. Knowledge of basic mathematics at degree level is assumed. This self-contained course consists of four parts:

  1. Speech production mechanism and nature of speech signals. Discussion on how the characteristics of speech production are reflected in the speech signal.
  2. Basics of digital signal processing. Introduces basic concepts of equivalent representations of signals and systems, and some tools for processing speech signals.
  3. Speech signal processing methods. Introduces basics of short-time spectrum analysis and linear prediction analysis, the two most important speech analysis methods.
  4. Epoch-based analysis of speech. Introduces new evolving methods of speech analysis for extracting the time-varying characteristics of the source and system using the knowledge of speech production.

References:

  • L.R. Rabiner and R.W. Schafer, Theory and Applications of Digital Speech Processing. Pearson Education, Inc., 2011
  • L.R. Rabiner and B.H. Jaung, Fundamentals of Speech Recognition. Pearson Education, Inc., 1993
  • B. Yegnanarayana and Suryakanth V. Gangashetty, Epoch-based analysis of speech signals, in the special issue on Speech Communication and Signal Processing, SADHANA, Academy Proceedings in Engineering Sciences, Indian Academy of Sciences: 651-697, October 2011