2014 Tarragona International Summer School on
Trends in Computing

Tarragona, Spain, July 7-11, 2014

Course Description

Keynote Speakers

Larry S. Davis (U Maryland, College Park), Object Recognition and Image Labeling


Summary. This talk will start with an historical perspective of the fundamental problem of how everyday objects can be represented and recognized in images. Should objects be represented based on their intrinsic characteristics, or based on how they “appear” to a sensor system? Given an image of a complex scene containing many objects, how do we capture and apply everyday common sense knowledge to identify all of the objects in such an image? This problem, which I will refer to as the image labeling problem, involves challenging combinatorial problems associated with constructing globally “optimal” descriptions of images in terms of potentially very large collections of object types. The constraints (knowledge) that are utilized in these optimization procedures are loosely referred to as “context.” So, for example, vehicles are generally supported by the ground, so that an estimate of ground plane location parameters in an image constrains positions and apparent sizes of vehicles.

Another source of context are the everyday spatial and temporal relationships between objects; so, for example, keyboards are typically “on” tables and not “on” cats. The next part of the talk will discuss how visually grounded models of object appearance and relations between objects can be simultaneously learned from weakly labeled images (images which are linguistically but not spatially annotated – i.e., we are told there is a car in the image, but not where the car is located). Once these models are acquired, one approach to inferring what objects appear in a new image is to segment the image into pieces, construct a graph based on the regions in the segmentation and the relationships modeled, and then apply probabilistic inference to the graph. Recent research in computer vision relaxes the (unreasonable) assumption that one can segment an image into regions that correspond to objects. I will describe an approach that can simultaneously construct instances of objects out of collections of connected segments that look like objects, while also softly enforcing contextual constraints.

Finally, I will end the talk by discussing some emerging research themes in computer vision that should contribute to further progress in recognition and labeling. These include attribute models for objects, methods for proposing object-like segments for labeling and deep learning.


  • Abhinav Gupta and Larry S. Davis, Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers, in ECCV 2008
  • W. R. Schwartz and A. Kembhavi and D. Harwood and L. S. Davis, Human Detection Using Partial Least Squares Analysis, IEEE International Conference on Computer Vision, pp. 24-31, 2009
  • Xi Chen, Arpit Jain, Abhinav Gupta and Larry S Davis, Piecing Together the Segmentation Jigsaw using Context, In CVPR 2011

Bio. Prof. Larry S. Davis is a Professor in the Institute for Advanced Computer Studies and the Department of Computer Science at the University of Maryland, College Park MD. He received his B.A. from Colgate University and both his M.S. and Ph. D. from the University of Maryland. He was the founding Director of the Institute for Advanced Computer Studies and served as the chair of the Department of Computer Science from 1998-2012. Prof. Davis is well known for his contributions to computer vision, especially to video surveillance and video data analysis. He has served as both program chair and general chair for many major conferences, including Computer Vision and Pattern Recognition and the International Conference on Computer Vision. He has served on DARPA’s ISAT advisory panel. Prof. Davis has published over 250 papers in leading conferences and journals on computer vision and has advised more than 40 Ph. D. students. He is a fellow of IAPR, ACM and the IEEE.

David S. Johnson (Columbia U, New York), Open and Closed Problems in NP-Completeness


Summary. The Theory of NP-Completeness today provides links between the many areas of computer science, together with important questions in mathematics, operations research, economics, and even the physical sciences. A resolution to its central question, whether P equals NP, will now win you a $1,000,000 Millenium Prize. A "yes" answer might yield a wide-ranging technological and scientific revolution, while a "no" answer will at least allow the Internet commerce industry to feel a bit more secure.

In this talk I say a little about the history (and pre-history) of the theory, which was initiated in 1971 by Steven Cook and Leonid Levin, independently, and then broadly illustrated by Richard Karp in 1972. I survey some of the major NP-completeness and polynomial-time solvability results since then, as well as the many failed attempts at proving (or disproving) P = NP. I conclude with an exploration of the ways in which the theory has expanded from feasibility and optimization questions to ones about approximation, greatly assisted by the alternative characterization of NP in terms of "probabilistically checkable proofs" in the early 1990s, and list some of the key open questions remaining in this domain.

Bio. David S. Johnson is a Visiting Professor of Computer Science at Columbia University. For the previous 40 years he was a researcher at AT&T (in its many incarnations), serving as Head of its Algorithms and Optimization Research Department from 1988 to 2013. He received a PhD in Mathematics from MIT in 1973. He is perhaps best known as the co-author of COMPUTERS AND INTRACTABILITY: A GUIDE TO THE THEORY OF NP-COMPLETENESS, for which he and his co-author M. R. Garey won the INFORMS Lanchester Prize. His research interests include approximation algorithms for combinatorial problems, a subject on which he had several of the first key papers, starting with his PhD thesis on the "bin packing" problem, and continuing with extensive work on that and the Traveling Salesman Problem (TSP). In recent years, he has become a leader in the experimental analysis of algorithms, and has been the creator and lead organizer for the series of DIMACS Implementation Challenges, covering topics from Network Flows to the TSP. He also founded the annual ACM-SIAM Symposium on Discrete Algorithms (SODA) and served as its Steering Committee Chair from 1990 to 2013. He has over 100 scientific publications and, in addition, has written 26 editions of an ongoing column on the theory of NP-completeness, currently appearing in the ACM Transactions on Algorithms. He is an ACM Fellow, a SIAM Fellow, an AT&T Fellow, and winner of the 2010 Knuth Prize for outstanding contributions to the foundations of computer science.

George Karypis (U Minnesota, Twin Cities), Top-N Recommender Systems: Revisiting Item Neighborhood Methods


Summary. Top-N recommender systems are designed to generate a ranked list of items that a user will find useful based on the user’s prior activity. These systems have become ubiquitous and are an essential tool for information filtering and (e-)commerce. Over the years, collaborative filtering, which derive these recommendations by leveraging past activities of groups of users, has emerged as the most prominent approach for solving this problem. Among the multitude of methods that have been developed, item-based nearest neighbor algorithms are among the simplest and yet best-performing methods for Top-N recommender systems. These methods rank the items to be recommended based on how similar they are to the items in a user’s prior activity history, using various co-occurrence similarity measures. In this talk we present our recent work in these item-based neighborhood methods that has substantially improved the accuracy of the predictions. One shortcoming of traditional item-based neighborhood methods is that they rely on a similarity measure that needs to be specified a priori. To address this problem we developed a class of item-based neighborhood methods that directly estimate from the training data a sparse item-item similarity matrix. This similarity matrix is estimated using a structural equation-modeling (SEM) framework, which requires each column of the user-item matrix to be approximated as a sparse aggregation of some other columns. These other columns correspond to the learned neighbors and their aggregation weights to the learned similarities. A second shortcoming of item-based neighborhood methods is that the item-item similarity measures rely on co-occurrences, which become problematic when the datasets are very sparse and the number of items pairs with sufficiently many co-occurrences is small. To address this problem we extended the SEM framework to estimate a factored version of the item-item similarity matrix. This factored representation projects the items in a lower dimensional space, which allows for meaningful similarity estimates between items that never co-occurred in the original user-item matrix. In addition, we also discuss and present results from our work to enhance the above SEM-models that incorporates item side information and higher-order item relations to further improve the Top-N recommendation accuracy and methods that address the item cold-start recommendation problem.

Bio. George Karypis is a professor at the Department of Computer Science & Engineering at the University of Minnesota, Twin Cities. His research interests spans the areas of data mining, high performance computing, information retrieval, collaborative filtering, bioinformatics, cheminformatics ,and scientific computing. His research has resulted in the development of software libraries for serial and parallel graph partitioning (METIS and ParMETIS), hypergraph partitioning (hMETIS), for parallel Cholesky factorization (PSPASES), for collaborative filtering-based recommendation algorithms (SUGGEST), clustering high dimensional datasets (CLUTO), finding frequent patterns in diverse datasets (PAFI), and for protein secondary structure prediction (YASSPP). He has coauthored over 230 papers on these topics and two books (“Introduction to Protein Structure Prediction: Methods and Algorithms” (Wiley, 2010) and “Introduction to Parallel Computing” (Publ. Addison Wesley, 2003, 2nd edition)). In addition, he is serving on the program committees of many conferences and workshops on these topics, and on the editorial boards of the IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Knowledge Discovery from Data, Data Mining and Knowledge Discovery, Social Network Analysis and Data Mining Journal, International Journal of Data Mining and Bioinformatics, the journal on Current Proteomics, Advances in Bioinformatics, and Biomedicine and Biotechnology.

Steffen Staab (U Koblenz), Explicit and Implicit Semantics: Two Sides of One Coin


Summary. The challenge of understanding data has first been tackled by philosophers and linguistics who asked what it means if people say something. Traditionally, from Aristoteles to Wittgenstein I a logic representation perspective has been pursued that models categories and uses such categories to determine the meaning of words and sentences. In the 20th century empiricists in philosophy and linguistics took a new point of view that understood semantics of language as a social construct and tried to elicit semantics from the observation of language and communication context.


  • N. Guarino, D. Oberle, S. Staab, What is an ontology? In: S. Staab & R. Studer. Handbook on Ontologies. Springer, 2009
  • P. Cimiano, A. Mädche, S. Staab, J. Völker. Ontology learning. In: S. Staab & R. Studer. Handbook on Ontologies. Springer, 2009

Philip Wadler (U Edinburgh), You and Your Research and The Elements of Style


Summary. This talk surveys advice from experts, including Richard Hamming, William Strunk, E. B. White, Donald Knuth, and others, on how to conduct your research and communicate your results.

Bio. Philip Wadler likes to introduce theory into practice, and practice into theory. An example of theory into practice: GJ, the basis for Java with generics, derives from quantifiers in second-order logic. An example of practice into theory: Featherweight Java specifies the core of Java in less than one page of rules. He is a principal designer of the Haskell programming language, contributing to its two main innovations, type classes and monads.

Wadler is Professor of Theoretical Computer Science at the University of Edinburgh. He is an ACM Fellow and a Fellow of the Royal Society of Edinburgh, served as Chair of ACM SIGPLAN, and is a past holder of a Royal Society-Wolfson Research Merit Fellowship. Previously, he worked or studied at Stanford, Xerox Parc, CMU, Oxford, Chalmers, Glasgow, Bell Labs, and Avaya Labs, and visited as a guest professor in Copenhagen, Sydney, and Paris. He has an h-index of 59, with more than 17,500 citations to his work according to Google Scholar. He is a winner of the POPL Most Influential Paper Award, has contributed to the designs of Haskell, Java, and XQuery, and is a co-author of Introduction to Functional Programming (Prentice Hall, 1988), XQuery from the Experts (Addison Wesley, 2004) and Generics and Collections in Java (O'Reilly, 2006). He has delivered invited talks in locations ranging from Aizu to Zurich.

Ronald R. Yager (Iona C, New Rochelle), Social Modeling


Summary. Computer mediated social networks are now an important technology for world–wide communication, interconnection and information sharing. Our goal here is to enrich social network modeling by introducing ideas from fuzzy sets. We approach this extension in two ways. One is with the introduction of fuzzy graphs representing the networks. This allows a generalization of the types of connection between nodes in a network from simply connected or not to weighted or fuzzy connections. A second and perhaps more interesting extension is the use of the fuzzy set based paradigm of computing with words to provide a bridge between a human network analyst's linguistic description of social network concepts and the formal model of the network. We also will describe some methods for sharing the large data obtained in these types of networks using linguistic summarization and tagging methods Time permitting we shall look at some issues related to group decision making and diversity.

Bio. Ronald R. Yager is Director of the Machine Intelligence Institute and Professor of Information Systems at Iona College. He is editor and chief of the International Journal of Intelligent Systems. He has published over 500 papers and edited over 30 books in areas related to fuzzy sets, human behavioral modeling, decision-making under uncertainty and the fusion of information. He is among the world’s top 1% most highly cited researchers with over 38000 citations in Google Scholar. He was the recipient of the IEEE Computational Intelligence Society Pioneer award in Fuzzy Systems. He received the special honorary medal of the 50-th Anniversary of the Polish Academy of Sciences. He received the Lifetime Outstanding Achievement Award from International the Fuzzy Systems Association. He recently received honorary doctorate degrees, honoris causa, from the State University of Information Technologies, Sofia Bulgaria and the Azerbaijan Technical University in Baku. Dr. Yager is a fellow of the IEEE, the New York Academy of Sciences and the Fuzzy Systems Association. He has served at the National Science Foundation as program director in the Information Sciences program. He was a NASA/Stanford visiting fellow and a research associate at the University of California, Berkeley. He has been a lecturer at NATO Advanced Study Institutes. He is a visiting distinguished scientist at King Saud University, Riyadh Saudi Arabia. He is a distinguished honorary professor at the Aalborg University Denmark. He received his undergraduate degree from the City College of New York and his Ph. D. from the Polytechnic Institute New York University.

Divyakant Agrawal (Qatar Computing Research Institute, Doha), [intermediate] Scalable Data Management in Enterprise and Cloud Computing Infrastructures


Summary. Data management systems and technologies have historically played a pivotal role in the context of computing environments that involve large volumes of data and information. In fact, data base management systems (DBMS) are the critical components of most data-intensive application infrastructure. Furthermore, the underlying technologies, both in terms of language and query models as well as with respect to the system architectures, have reached a level of maturity that has enabled its use as a plug-and-play component without the need for detailed learning of its internals.

During the past decade, however, the entire area of data-management especially as it pertains to large-scale data arising from Internet and Web-based applications is at the cross-roads. The main question in this debate is the effectiveness of old DBMS paradigms: declarative query languages, independence of logical and physical data model, and the computational framework based on the Transaction concept. Several of the large Internet companies such as Google, Yahoo, and Amazon have put forth competing solutions for both building data-intensive scalable applications over the Internet/Web as well as for large-scale data anlaytics.

In the past couple of years, however, as cloud computing has become a pervasive technology and much of our data is being stored and managed in the cloud, many of these proponents of new data management technology are having a change of heart. Some of these issues have arisen due to recent events where datacenter failures have resulted in loss of user data. As a result, new architectures are emerging that rely on traditional DBMS abstractions albeit with a new twist.

During this course, we will conduct a joint exploration to gain a deeper understanding to participate in this debate. In particular, the following topics will be covered:

Bio. Dr. Divyakant Agrawal is a Professor of Computer Science and the Director of Engineering Computing Infrastructure at the University of California at Santa Barbara. His research expertise is in the areas of database systems, distributed computing, data warehousing, and large-scale information systems. From January 2006 through December 2007, Dr. Agrawal served as VP of Data Solutions and Advertising Systems at the Internet Search Company ASK.com. Dr. Agrawal has also served as a Visiting Senior Research Scientist at the NEC Laboratories of America in Cupertino, CA from 1997 to 2009. During his professional career, Dr. Agrawal has served on numerous Program Committees of International Conferences, Symposia, and Workshops and served as an editor of the journal of Distributed and Parallel Databases (1993-2008), and the VLDB journal (2003-2008). He currently serves as the Editor-in-Chief of Distributed and Parallel Databases and is on the editorial boards of the ACM Transactions on Database Systems and IEEE Transactions of Knowledge and Data Engineering. He has recently been elected to the Board of Trustees of the VLDB Endowment and elected to serve on the Executive Committee of ACM Special Interest Group SIGSPATIAL. Dr. Agrawal's research philosophy is to develop data management solutions that are theoretically sound and are relevant in practice. He has published more than 320 research manuscripts in prestigious forums (journals, conferences, symposia, and workshops) on wide range of topics related to data management and distributed systems and has advised more than 35 Doctoral students during his academic career. He received the 2011 Outstanding Graduate Mentor Award from the Academic Senate at UC Santa Barbara. Recently, Dr. Agrawal has been recognized as an Association of Computing Machinery (ACM) Distinguished Scientist in 2010 and was inducted as an ACM Fellow in 2012. He has also been inducted as a Fellow of IEEE in 2012. His current interests are in the area of scalable data management and data analysis in Cloud Computing environments, security and privacy of data in the cloud, and scalable analytics over social networks data and social media.

Pierre Baldi (U California, Irvine), [intermediate] Big Data Informatics Challenges and Opportunities in the Life Sciences


Summary. This short course will explore the rich, two-way, interplay between the life- and the computational-sciences and the informatics challenges and opportunities created by big data generated by a variety of high-throughput technologies, such DNA sequencing, and the Web. Data examples will be given at multiple scales, from small molecules in chemoinformatics and drug discovery, to proteins, to genomes and beyond. The Bayesian statistical framework and statistical machine learning methods will be described as key tools to address some of these challenges together with integrative systems biology approaches that can combine information from multiple sources. To close the loop we will show how biology inspires machine learning methods, such as deep architectures and deep learning. The methods will be applied to difficult problems such as predicting the structure of proteins or understanding gene regulation on a genome-wide scale. If time permits, we will briefly survey related societal issues from genome privacy to personalized medicine.


Bio. Pierre Baldi is Chancellor's Professor in the Department of Computer Science, Director of the Institute for Genomics and Bioinformatics, and Associate Director of the Center for Machine Learning and Intelligent Systems at the University of California, Irvine. He received his PHD degree from the California Institute of Technology. His research work is at the interface of the computational and life sciences, in particular the application of artificial intelligence and statistical machine learning methods to problems in chemoinformatics, proteomics, genomics, systems biology, and computational neuroscience. He is credited with pioneering the use of Hidden Markov Models (HMMs), graphical models, and recursive neural networks in bioinformatics. Dr. Baldi has published over 260 peer-reviewed research articles and four books and has a Google Scholar h-index of 64. He is the recipient of the 1993 Lew Allen Award, the 1999 Laurel Wilkening Faculty Innovation Award, a 2006 Microsoft Research Award, and the 2010 E. R. Caianiello Prize for research in machine learning. He is Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), the Association for Computing Machinery (ACM), and the Institute of Electrical and Electronics Engineers (IEEE).

Rajkumar Buyya (U Melbourne), [intermediate] Cloud Computing


Summary. Computing is being transformed to a model consisting of services that are commoditised and delivered in a manner similar to utilities such as water, electricity, gas, and telephony. In such a model, users access services based on their requirements without regard to where the services are hosted. Several computing paradigms have promised to deliver this utility computing vision. Cloud computing is the most recent emerging paradigm promising to turn the vision of "computing utilities" into a reality. Cloud computing has emerged as one of the buzzwords in the IT industry. Several IT vendors are promising to offer storage, computation and application hosting services, and provide coverage in several continents, offering Service-Level Agreements (SLA) backed performance and uptime promises for their services. It delivers infrastructure, platform, and software (application) as services, which are made available as subscription-based services in a pay-as-you-go model to consumers. The price that Cloud Service Providers charge can vary with time and the quality of service (QoS) expectations of consumers.

This tutorial will cover (a) 21st century vision of computing and identifies various IT paradigms promising to deliver the vision of computing utilities; (b) the architecture for creating market-oriented Clouds by leveraging technologies such as VMs; (c) market-based resource management strategies that encompass both customer-driven service management and computational risk management to sustain SLA-oriented resource allocation; (d) Aneka, a software system for rapid development of Cloud applications and their deployment on private/public Clouds with resource provisioning driven by SLAs and user QoS requirements, (e) experimental results on deploying Cloud applications in engineering, gaming, and health care domains (integrating sensors networks, mobile devices), ISRO satellite image processing on elastic Clouds, and (f) need for convergence of competing IT paradigms for delivering our 21st century vision along with pathways for future research.

Bio. Dr. Rajkumar Buyya is Professor of Computer Science and Software Engineering, Future Fellow of the Australian Research Council, and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also serving as the founding CEO of Manjrasoft, a spin-off company of the University, commercializing its innovations in Cloud Computing. He has authored over 425 publications and four text books including "Mastering Cloud Computing" published by McGraw Hill and Elsevier/Morgan Kaufmann, 2013 for Indian and international markets respectively. He also edited several books including "Cloud Computing: Principles and Paradigms" (Wiley Press, USA, Feb 2011). He is one of the highly cited authors in computer science and software engineering worldwide (h-index=68, g-index=141, 22000+ citations). Microsoft Academic Search Index ranked Dr. Buyya as as the world's top author in distributed and parallel computing between 2007 and 2012. Recently, ISI has identified him as a “Highly Cited Researcher” based on citations to his journal papers.

Software technologies for Grid and Cloud computing developed under Dr. Buyya's leadership have gained rapid acceptance and are in use at several academic institutions and commercial enterprises in 40 countries around the world. Dr. Buyya has led the establishment and development of key community activities, including serving as foundation Chair of the IEEE Technical Committee on Scalable Computing and five IEEE/ACM conferences. These contributions and international research leadership of Dr. Buyya are recognized through the award of "2009 IEEE Medal for Excellence in Scalable Computing" from the IEEE Computer Society, USA. Manjrasoft's Aneka Cloud technology developed under his leadership has received "2010 Asia Pacific Frost & Sullivan New Product Innovation Award" and "2011 Telstra Innovation Challenge, People's Choice Award". He is currently serving as the foundation Editor-in-Chief (EiC) of IEEE Transactions on Cloud Computing. For further information on Dr. Buyya, please visit his cyberhome.

John M. Carroll (Pennsylvania State U, University Park), [introductory] Usability Engineering and Scenario-based Design


Summary. Computers do more than just provide information and services for people to use. The design of computing systems is part of an ongoing cycle in which new technologies raise new opportunities for human activity; as people’s tasks change in response to these opportunities, new needs for technology arise. The basic argument behind scenario-based methods is that descriptions of people using technology are essential in discussing and analyzing how the technology is (or could be) reshaping their activities. A secondary advantage is that scenario descriptions can be created before a system is built and its impacts felt.

Bio. John M. Carroll is Distinguished Professor of Information Sciences and Technology at the Pennsylvania State University. His research is in methods and theory in human-computer interaction, particularly as applied to networking tools for collaborative learning and problem solving, and design of interactive information systems. Recent books include Making Use (MIT, 2000), Usability Engineering (Morgan-Kaufmann, 2002, with M.B. Rosson), Rationale-Based Software Engineering (Springer, 2008, with J. Burge, R. McCall and I. Mistrik), Learning in Communities (Springer, 2009), The Neighborhood in the Internet: Design Research Projects in Community Informatics (Routledge, 2012), and Creativity and Rationale: Enhancing Human Experience by Design (Springer, 2012). Carroll serves on several advisory and editorial boards for journals, handbooks, and series. He is editor of the Synthesis Lectures on Human-Centered Informatics. Carroll has received the Rigo Award and the CHI Lifetime Achievement Award from ACM, the Silver Core Award from IFIP, the Goldsmith Award from IEEE. He is a fellow of AAAS, ACM, IEEE, the Human Factors and Ergonomics Society, and the Association for Psychological Science. In 2012, he received an honorary doctorate in engineering from Universidad Carlos III de Madrid.

Kwang-Ting (Tim) Cheng (U California, Santa Barbara), [introductory/intermediate] Smartphones: Hardware Platform, Software Development, and Emerging Apps


Summary. Smartphones are revolutionizing the way we communicate, work, study, socialize, play, and participate in society. As smartphones continue to evolve with more computing power and bandwidth, with larger display, and with better user interfaces, new apps are merging for each new generation of devices. One key to smartphones’ wide adoption is its ability to connect to multiple networks. With their rich connectivity, smartphones have become gateways to the cloud and bridges to the sensor swarm, in addition to the robust connection to other wireless handsets. The innovations behind these amazing devices, including hardware, OS, communications, and apps, create some unique research and education opportunities. In this short course, I will introduce key components of the hardware platform, the OS and software development platform, and emerging apps of modern smartphones and tablets. I will also discuss future trends and technical challenges of further advancement of smartphones. What you will learn from this short course include:

Bio. Cheng received his Ph.D. in EECS from the University of California, Berkeley in 1988. He worked at Bell Laboratories from 1988 to 1993 and joined the faculty at the University of California, Santa Barbara in 1993 where he is now acting Associate Vice Chancellor for Research and Professor in ECE. He was the founding director of UCSB’s Computer Engineering Program (1999-2002) and Chair of the ECE Department (2005-2008). He held a Visiting Professor position at National TsingHua Univ., Taiwan (1999), Univ. of Tokyo, Japan (2008), Hong Kong Univ. of Science and Technology (2012), and Zhejiang University, China. His current research interests include mobile embedded systems, SoC design validation and test, and multimedia computing and currently serves as Director for DoD/MURI Center for 3D hybrid circuits which aims at integrating CMOS with high-density memristors. He has published more than 350 technical papers, co-authored five books, supervised 30 PhD dissertations, and holds 12 U.S. Patents in these areas.

Cheng, an IEEE fellow, received ten Best Paper Awards from various IEEE conferences and journals. He has also received the 2004-2005 UCSB College of Engineering Outstanding Teaching Faculty Award. He served as Editor-in-Chief of IEEE Design and Test of Computers (2006-2009) and was a board member of IEEE Council of Electronic Design Automation’s Board of Governors, IEEE Computer Society’s Publication Board, and working groups of International Technology Roadmap for Semiconductors (ITRS). He has also served as General and Program Chair for several international conferences including 2012 IEEE International Test Conference.

Amr El Abbadi (U California, Santa Barbara), [introductory] The Distributed Foundations of Data Management in the Cloud


Summary. Over the past few decades, database and distributed systems researchers have made significant advances in the development of protocols and techniques to provide data management solutions that carefully balance three major requirements when dealing with critical data: high availability, fault-tolerance, and data consistency. However, over the past few years the data requirements, in terms of data availability and system scalability for Internet scale enterprises that provide services and cater to millions of users, have been unprecedented. Cloud computing has emerged as an extremely successful paradigm for deploying Internet and Web-based applications. Scalability, elasticity, pay-per-use pricing, and autonomic control of large-scale operations are the major reasons for the successful widespread adoption of cloud infrastructures. In this talk, we will first discuss some of the critical distributed systems and database protocols that are essential for understanding current large scale data management. We analyze the design choices that allowed modern NoSQL data management systems (key-value stores) to achieve orders of magnitude higher levels of scalability compared to traditional databases. Although key-value stores provide high availability, they are not ideal for applications that require consistent data views. More recently, there has been a gradual shift to provide transactions with strong consistency, for example Google's Megastore and Spanner. We will analyze the need and the revival of SQL database management systems in large cloud settings. We will discuss several state of the art Cloud based data management systems, which provide transactional guarantees on collections of data items, thus supporting complex SQL, while attempting to ensure efficiency and scalability. Of particular interest are applications which require geo-replicated data management for fault-tolerance. We will therefore explore the data management design space for geo-replicated data and discuss different approaches for replicating data in multi-datacenter environments.


  1. Distributed Systems
    • Causality and logical clocks
    • Global States, Replicated Logs and Elections
    • Consensus, broadcasts and Paxos
    • Peer to Peer Systems
  2. Data Management
    • Transactions and Serializability Theory
    • Concurrency Control
    • Distributed Commitment
  3. Key-Value Stores and the Cloud
    • Bigtable and The Google File System
    • PNUTS
    • Dynamo and Cassandra
  4. Large Scale Consistent Cloud Data Management
    • Co-located transactional support
    • Elasticity and live migration
    • Geo-replication across Data Centers
  5. Other State of the art Topics

References. State of the art collection of papers.

Pre-requisites. Basic knowledge of operating systems and a solid knowledge of data structures and algorithms.

Bio. Amr El Abbadi is a Professor of Computer Science Department at the University of California, Santa Barbara. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. Prof. El Abbadi is an ACM Fellow, an AAAS Fellow, an IEEE Fellow and was Chair of the Computer Science Department at UCSB from 2007 to 2011. He has served as a journal editor for several database journals, including, currently, The VLDB Journal, The Computer Journal and IEEE Transactions on Computers. He has been Program Chair for multiple database and distributed systems conferences, most recently ACM SIGSPATIAL GIS 2010, ACM Symposium on Cloud Computing (SoCC) 2011, the International Conference on Management of Data (COMAD), India in 2012 and the ACM Conference on Social Networks (COSN) 2013. He has served as a board member of the VLDB Endowment from 2002—2008, and is currently a member of the Executive Committee of the Technical Committee of Data Engineering (TCDE). In 2007, Prof. El Abbadi received the UCSB Senate Outstanding Mentorship Award for his excellence in mentoring graduate students. He has published over 300 articles in databases and distributed systems.

Richard M. Fujimoto (Georgia Tech, Atlanta), [introductory] Parallel and Distributed Simulation


Summary. This course will give an overview of the parallel and distributed simulation field, dating back to its roots in the late 1970’s and 1980’s up through applications and current research topics in the field today. The emphasis of the course is on discrete event simulations that are used in a variety of applications such as modeling computer communication networks, manufacturing systems, supply chains, transportation and other urban infrastructures, to mention a few. The course will also cover topics related to distributed virtual environments used for entertainment and training. The course focuses on issues arising in the execution of discrete event simulation programs on parallel computers ranging from multicore desktop machines to supercomputers, as well as workstations connected through local area and wide area networks. This series of lectures will cover the fundamental principles and underlying algorithms used in parallel and distributed simulation systems as well as applications and recent advances. Distributed simulation standards such as the High Level Architecture (HLA) will be discussed that enable separately developed simulations to interoperate. The technical underpinnings and algorithms required to implement distributed simulation systems using the HLA will be presented. A general computing background is assumed, but no prior knowledge of simulation is required.


Bio. Dr. Richard Fujimoto is Regents’ Professor and founding Chair of the School of Computational Science and Engineering at the Georgia Institute of Technology since 2005. He is currently the Interim Director of the Institute for Data and High Performance Computing at Georgia Tech. He received the Ph.D. and M.S. degrees from the University of California (Berkeley) in 1980 and 1983 in Computer Science and Electrical Engineering and B.S. degrees from the University of Illinois (Urbana) in 1977 and 1978 in Computer Science and Computer Engineering.  He has been an active researcher in the parallel and distributed simulation community since 1985. He led the definition of the time management services for the High Level Architecture (IEEE Standard 1516).  Fujimoto has served as Co-Editor-in-chief of the journal Simulation: Transactions of the Society for Modeling and Simulation International as well as a founding area editor for the ACM Transactions on Modeling and Computer Simulation journal.  He has served on the organizing committees for several leading conferences in the parallel and distributed simulation field.

Mark Guzdial (Georgia Tech, Atlanta), [introductory] Computing Education Research: What We Know about Learning and Teaching Computer Science


Summary. Researchers have studied how people learn computer science since the 1960's. This course surveys the literature in computing education. Topics include theories of learning that have influenced computing education approaches (from behaviorism to constructionism), curricular approaches (including SICP, STREAM, HTDP, and Media Computation), analyses of student misconceptions, programming languages and environments designed to facilitate student learning, student misconceptions of computer science, and teaching approaches.


  1. How People Learn: Constructivism, Transfer, and Novice-Expert Differences
  2. A Brief History of Computer Science Education
    • Logo and Seymour Papert
    • Boxer and Andrea diSessa
    • Smalltalk and Alan Kay
    • Scheme, SICP, and HTDP
    • Boot-Scaffs and Sally Fincher
    • Commonsense Computing
    • Graphical programming languages including Alice and Scratch
  3. Student Misconceptions about Computer Notational Machines
  4. Approaches to Teaching Computer Science
    • Peer Instruction
    • Pair Programming
    • Media Computation and Worked Examples
    • Gesture, Kinesthetic Learning Activities, and CS Unplugged

Bio. Mark Guzdial is a Professor in the School of Interactive Computing in the College of Computing at Georgia Institute of Technology. His research focuses on learning sciences and technology, specifically, computing education research. He has published several books on the use of media as a context for learning computing. He was the original developer of the "Swiki" which was the first wiki designed for educational use. He was awarded a joint Ph.D. degree in Education and Computer Science from the University of Michigan in 1993. He serves on the ACM's Education Council, and is on the editorial boards of the "Journal of the Learning Sciences," "ACM Transactions on Computing Education," and "Communications of the ACM." With his wife and colleague, Barbara Ericson, he received the 2010 ACM Karl V. Karlstrom Outstanding Educator award. He was also the recipient of the 2012 IEEE Computer Society Undergraduate Teaching Award.

David S. Johnson (Columbia U, New York), [introductory] The Traveling Salesman Problem in Theory and Practice


Summary. In the Traveling Salesman Problem (TSP), we are given a set of cities and the distances between them, and are asked for the shortest tour that visits all cities and returns to its starting point. This famous optimization problem is NP-hard, and is often used to illustrate the concepts of NP-hardness and of exponential time. This can be misleading, however, since the TSP is one of the most well-solved of all the NP-hard problems. Typically, real-world instances with as many as 1000 cities can be solved optimally in hours (or minutes), and one can get within 1-2% of optimal in reasonable time for instances with up to a million cities.

In this course we shall consider the various approaches, both theoretical and practical, to solving the TSP, both exactly and approximately. We will cover results both classical and new. Almost every combinatorial optimization technique known has been tried on the TSP, so the seminar should provide a useful case study for people interested more generally in combinatorial optimization.

Among the topics likely to be included are

  1. Approximation algorithms (and schemes) for which we can prove worst-case performance guarantees or expected performance bounds.
  2. Local search heuristics (2-Opt, 3-Opt, Lin-Kernighan, etc.) that perform well in practice, and the methods used to make them run quickly.
  3. Good and bad methodologies for testing performance of heuristics.
  4. Performance of simulated annealing, genetic algorithms, and other metaheuristics on the TSP.
  5. Polynomial-time solvable special cases.
  6. Polynomial-time computable lower bounds on the optimal tour length.
  7. The principles of "branch-and-cut" algorithms for finding optimal tours (and proving optimality) and how they are used in the "Concorde" software package.
  8. "Proofs" that the TSP can be solved in polynomial time, and the obstacles they face.
  9. The asymmetric TSP and other variants, such as the minimum- latency TSP, the Prize-Collecting TSP, etc.

This is an introductory course, and so only the easiest proofs will be presented in detail. The course will be self-contained, and directions for further reading will be presented as we go along.


An undergraduate or graduate course in algorithms and data structures, plus a knowledge of NP-completeness.

George Karypis (U Minnesota, Twin Cities), [intermediate] Programming Models/Frameworks for Parallel & Distributed Computing


Summary. This is an introductory course on parallel computing. It will provide an overview of the different types of parallel architectures, it will present various principles associated with decomposing various computational problems towards the goal of deriving efficient parallel algorithms, it will present and analyze various efficient parallel algorithms for a wide-range of problems (e.g., sorting, dense/sparse linear algebra, graph algorithms, discrete optimization problems), and it will discuss various popular parallel programming paradigms and APIs (e.g., message passing interface (MPI), openMP, Posix threads, and CUDA). The course combines theory (complexity of parallel algorithms) with practical issues such as parallel architectures and parallel programming.


Bio. George Karypis is a Professor of Computer Science in the Department of Computer Science & Engineering at the University of Minnesota. His research interests span the areas of data mining, bioinformatics, cheminformatics, high performance computing, information retrieval, collaborative filtering, and scientific computing. His research has resulted in the development of software libraries for serial and parallel graph partitioning (METIS and ParMETIS), hypergraph partitioning (hMETIS), for parallel Cholesky factorization (PSPASES), for collaborative filtering-based recommendation algorithms (SUGGEST), clustering high dimensional datasets (CLUTO), finding frequent patterns in diverse datasets (PAFI), and for protein secondary structure prediction (YASSPP). He has coauthored over 210 papers on these topics and two books. In addition, he is serving on the program committees of many conferences and workshops on these topics, and on the editorial boards of the IEEE Transactions on Knowledge and Data Engineering, Social Network Analysis and Data Mining Journal, International Journal of Data Mining and Bioinformatics, the journal on Current Proteomics, Advances in Bioinformatics, and Biomedicine and Biotechnology.

Aggelos K. Katsaggelos (Northwestern U, Evanston), [intermediate] Optimization Techniques for Sparse/Low-rank Recovery Problems in Image Processing and Machine Learning


Summary. Due to their wide applicability, sparse and low-rank models have quickly become some of the most important tools for today’s researchers in image/video processing, computer vision, machine learning, statistics, optimization, and bioinformatics. Application areas in which sparse and low-rank modelling tools have been applied span a wide range of topics in these fields including: image inpainting and compressive sensing, object/face recognition, clustering and classification, Deep Learning feature selection, Collaborative Filtering, video surveillance, and many more. In this tutorial, based on a book manuscript currently under development, we will first introduce sparse and low-rank models in the context of various applications from image and video processing and machine learning. We will then describe methods used to solve sparse and low-rank recovery problems. This coverage of optimization techniques will include: a discussion of greedy methods for sparse recovery, smooth reformulation techniques for sparse and low-rank problems, and gradient techniques.


References. The course material is based on a text under development. A number of papers from the contemporary literature on sparsity will be referenced.

Pre-requisites. Basic knowledge of Vector Calculus, Linear Algebra, and Optimization techniques.

Bio. Aggelos K. Katsaggelos received the Diploma degree in electrical and mechanical engineering from the Aristotelian University of Thessaloniki, Greece, in 1979, and the M.S. and Ph.D. degrees in Electrical Engineering from the Georgia Institute of Technology, in 1981 and 1985, respectively.

In 1985, he joined the Department of Electrical Engineering and Computer Science at Northwestern University, where he is currently a Professor holder of the AT&T chair (previous holder of the Ameritech Chair of Information Technology). He is also the Director of the Motorola Center for Seamless Communications, a member of the Academic Staff, NorthShore University Health System, an affiliated faculty at the Department of Linguistics and he has an appointment with the Argonne National Laboratory.

He has published extensively in the areas of multimedia signal processing and communications (over 200 journal papers, 500 conference papers and 40 book chapters) and he is the holder of 23 international patents. He is the co-author of Rate-Distortion Based Video Compression (Kluwer, 1997), Super-Resolution for Images and Video (Claypool, 2007) and Joint Source-Channel Video Transmission (Claypool, 2007).

Among his many professional activities Prof. Katsaggelos was Editor-in-Chief of the IEEE Signal Processing Magazine (1997–2002), a BOG Member of the IEEE Signal Processing Society (1999–2001), and a member of the Publication Board of the IEEE Proceedings (2003-2007). He is a Fellow of the IEEE (1998) and SPIE (2009) and the recipient of the IEEE Third Millennium Medal (2000), the IEEE Signal Processing Society Meritorious Service Award (2001), the IEEE Signal Processing Society Technical Achievement Award (2010), an IEEE Signal Processing Society Best Paper Award (2001), an IEEE ICME Paper Award (2006), an IEEE ICIP Paper Award (2007), an ISPA Paper Award (2009), and a EUSIPCO paper award (2013). He was a Distinguished Lecturer of the IEEE Signal Processing Society (2007–2008).

Arie E. Kaufman (U Stony Brook), [intermediate/advanced] Visualization


Summary. Visualization is the discipline that creates imagery from typically big and complex data for the purpose of exploring the data. It is the method of choice to solve complex problems in practically every discipline, from science and engineering to business and medicine. Three case studies will be presented in detail along with the corresponding visualization techniques: (1) medical visualization, (2) visual simulation of flow, and (3) immersive visualization of big data. In medical visualization the focus is on 3D virtual colonoscopy for colon cancer screening. A 3D model of the colon is automatically reconstructed from a CT scan of the patient’s abdomen. The physician then interactively navigates through the volume-rendered virtual colon to locate colonic polyps and consults a system of computer-aided detection (CAD) of polyps. In visual simulation, the focus is on the Lattice Boltzmann Method (LBM) for real-time simulation and visualization of flow. LBM has been accelerated on commodity graphics processing units (GPUs), achieving accelerated real-time on a single GPU or on a GPU cluster. In immersive visualization, the data is effectively engulfing the user. A custom-built 5-wall Cave environment, the Immersive Cabin, is presented, along with a conformal deformation rendering method for visualizing data on partially-immersive platforms. To ameliorate the challenge of big data, the Reality Deck has been constructed. It is a unique 1.5-billion-pixel immersive display, tiling of 416 high-res panels driven by an 80-GPU cluster.

Bio. Arie E. Kaufman is a Distinguished Professor and Chairman of the Computer Science Department, Director of the Center of Visual Computing (CVC), and Chief Scientist of the Center of Excellence in Wireless and Information Technology (CEWIT) at Stony Brook University (aka the State University of New York at Stony Brook). He has conducted research for over 40 years in computer graphics and visualization and their applications, has published more than 300 refereed papers, books, and chapters, has delivered more than 20 invited keynote/plenary talks, has been awarded/filed more than 40 patents, has been a principal/co-principal investigator on more than 100 research grants, and has a Google Scholar h-index of 65. He is a Fellow of IEEE, a Fellow of ACM, inductee into the Long Island Technology Hall of Fame, and the recipient of the IEEE Visualization Career Award, as well as numerous other awards. He was the founding Editor-in-Chief of the IEEE Transaction on Visualization and Computer Graphics (TVCG), 1995-1998. He has been the co-founder/papers co-chair of IEEE Visualization Conferences, Volume Graphics Workshops, Eurographics/SIGGRAPH Graphics Hardware Workshops, and ACM Volume Visualization Symposia. He previously chaired and is currently a director of IEEE CS Technical Committee on Visualization and Graphics. He received a BS in Math and Physics from the Hebrew University, MS in Computer Science from the Weizmann Institute, and PhD in Computer Science from the Ben-Gurion University, Israel. For more information: www.cs.stonybrook.edu/∼ari.

Carl Lagoze (U Michigan, Ann Arbor), [introductory] Curation of Big Data


Summary. Data curation is the activity of managing data from their point of creation to ensure they are available for discovery and reuse in the future. This course will examine the foundational principles, requirements, techniques, and technologies for data curation. We will learn about the concepts, language, terminology, and tools of data curation, become familiar with the current state of knowledge, and identify research issues in the field. The course will bring together knowledge from a variety of fields including computer science, information science, science and technology studies, science of team science, and data science. Given the diversity of disciplines, interests, and the nature of problems in data curation, we will talk explicitly about interdisciplinary collaboration. We will also spend time in this class considering who is responsible for which aspects of data curation, how it fits or conflicts with established scientific practices, how data curation fits into the larger practice of ethical conduct of research.

The motivating context for this course is what many scientists, policy makers, and funders call a new “fourth” paradigm of science. The distinguishing characteristic of this new paradigm is the availability of vast amounts of data that will supposedly expedite the scientific process, lead to new discoveries that were not possible before, and make science more transparent and accessible. Many believe that this new paradigm and the infrastructure that supports it will facilitate work across disciplinary and geographic boundaries and thereby promote new, expansive approaches to grand challenge problems such as climate change. International funding agencies have responded by requiring more openness and sharing of data that is the result of research funded by the public. However, realizing the potential of openness in science, data sharing, and reuse of data has proven more difficult to accomplish than many of its advocates assumed. The difficulties arise from the complicated nature of infrastructure for sharing, preserving, and reusing data. As we will discover throughout the course, the complexities of this infrastructure extend beyond technical and include social, cultural, political, and economic factors.

The course is divided into three main sections. In the first section will look at the factors motivating the increasing emphasis on management, sharing, and preservation of research data. In the second section, we will focus on individual and organizational processes and technologies that are necessary for the management, storage, preservation, discovery, and use of data. These include data repositories, database structures, and metadata that support data curation. In the third section of the course, we will look at ethical issues, field-specific and general practices with regard to ownership and access to data, and open research questions in data curation.

Syllabus. Topics covered in the course are as follows:

  1. Understanding Data and Data Curation
    • Data Curation Overview: Course Description, Objectives, and Expectations
    • Research, Scholarship and Data Curation
    • Understanding Data and Scholarly Communication
    • Data Sharing and Reuse
  2. Repositories, Databases, Tools and Practices
    • Metadata and Ontologies
    • Data Models and Data Repositories
    • Identification and Citation
    • Lifecycle and Provenance
    • Linked Data
  3. Policy, Ethical, and Research Issues
    • Data Security, Trust, Privacy and Ethical Issues
    • Intellectual Property

References. Students will read a variety of papers from the contemporary scholarly literature on data curation. Readings will be announced prior to the beginning of the course.

Bio. Carl Lagoze is an Associate Professor at the University of Michigan School of Information. He received his PhD in information science at Cornell University where he held numerous research positions and subsequently served on the faculty of the Information Science Department. The overarching theme of his research for the past two decades is interoperability of information systems, spanning the full spectrum of technical and human components that are critical to create networked information systems that really work. Although the results of this research apply across a variety of information contexts, the primary thread of his research explores information systems to support scholarship and knowledge production. The results of this research career are manifested in a number of widely used applications, protocols, and standards in areas such as metadata harvesting, ontology definition, and repository architecture.

Dinesh Manocha (U North Carolina, Chapel Hill), [introductory/intermediate] Robot Motion Planning


Syllabus. Basic motion planning concepts, configuration spaces, sampling-based methods, graph search, navigation, and applications.


Pre-requisites. Some background in geometric algorithms.

Bio. Dinesh Manocha is currently the Phi Delta Theta/Mason Distinguished Professor of Computer Science at the University of North Carolina at Chapel Hill. He received his Ph.D. in Computer Science at the University of California at Berkeley 1992. He has received Junior Faculty Award, Alfred P. Sloan Fellowship, NSF Career Award, Office of Naval Research Young Investigator Award, Honda Research Initiation Award, Hettleman Prize for Scholarly Achievement. Along with his students, Manocha has also received 12 best paper awards at the leading conferences on graphics, geometric modeling, visualization, multimedia and high-performance computing. He has published more than 330 papers and some of the software systems related to collision detection, GPU-based algorithms and geometric computing developed by his group have been downloaded by more than 100,000 users and are widely used in the industry. He has supervised 21 Ph.D. dissertations and is a fellow of ACM, AAAS, and IEEE. He received Distinguished Alumni Award from Indian Institute of Technology, Delhi. More info here.

Bijan Parsia (U Manchester), [introductory] The Empirical Mindset in Computer Science


Summary. "Any field with 'science' in its name is not a science."

Most people working in the field of computer science think its a science and, indeed, a natural science. Computer science departments are often located in the same faculty as physical science (even if next to the electrical engineering department), and while there are important subfields which are primarily mathematical, wide swaths of computer science clearly require empirical backing. And yet, computer science is notoriously poor at empirical methods, though this is starting to change esp. in software engineering, human-computer interaction, and ontology engineering.

This course will provide a solid introduction to how to think like an empiricist in computer science. We will examine different models of science and the sorts of empirical methods appropriate to them. We will consider issues of experiment design, execution, and reporting, non-experimental empirical methods, uncertainty and conclusiveness, and what counts as a "contribution" to knowledge. When discussing experiments, we will examine the difference between computational and human experimentation as well as specific challenges that emerge with performance benchmarking.

By the end of the course, the student will have an appreciation of the complexity of empirical methods and how they fit into the totality of evidence for claims in computer science and tools for making sense of them.

Bio. Dr. Bijan Parsia is a Senior Lecturer at the School of Computer Science of the University of Manchester, UK. His primary area of research is logic based knowledge representation with an emphasis on ontology engineering. In addition to publishing on language features, explanation, modularity, difference, automated, etc. he was involved with the standardization at the W3C of the second version of the Web Ontology Language (OWL 2) and SPARQL, the query language for the Semantic Web. He was a designer and developer of the Swoop ontology editor and the Pellet OWL reasoner. 

Robert Sargent (Syracuse U), [introductory] Validation of Models


Summary. This course will start off with a review of the different types of models followed by a clear definition of terms such as verification and validation. We will then proceed to discuss validation of system theories and models with special emphasis on the verification and validation of simulation models. The different approaches to deciding model validity will be described and different paradigms that relate verification and validation to the model development process will be presented. The various validation techniques will be defined. Conceptual model validity, model verification, operational validity, and data validity will be discussed in detail. Included in the discussion of operational validity will be the validation of models of both observable and non-observable systems and using both objective and subjective decision approaches. Also included in the use of statistical approaches for operational validity are the use of both theoretical reference distributions such as the t and F distributions and the use of data generated by a simulation model as the reference distribution. Documentation of validation results and a recommended procedure for model validation will be presented.

Pre-requisites. An introduction to at least a couple types of mathematical models (e.g. discrete event simulation models and analytical models) and an introduction knowledge of statistics.


Bio. Dr. Robert G. Sargent is a Professor Emeritus of Syracuse University. At Syracuse, Dr. Sargent held appointments in different departments and interdisciplinary programs in the L. C. Smith College of Engineering and Computer Science and was Director of the Simulation Research Group. Professor Sargent received his education at The University of Michigan. Dr. Sargent’s current research interests include the methodology areas of modeling and discrete event simulation, model validation, and performance evaluation. Professor Sargent has made numerous research contributions in his career. He was one of the first individuals to initiate the modeling of computer systems for performance evaluation. Most of his research contributions have been in the methodology areas of simulation. Dr. Sargent is especially well known for his work in validation of simulation models. Professor Sargent has performed considerable professional service. He has received numerous awards for his various contributions.

Steffen Staab (U Koblenz), [intermediate] Programming the Semantic Web


Summary. The Semantic Web is a means for publishing and (re-)using structured data in a serendipitous manner and is now used in prominent applications such as by major search engines for schema.org or by media companies such as BBC or New York times and many more. In this course, I will give a survey of core concepts of the Semantic Web including:

Most of current Semantic Web applications do not make full use of the serendipity of data (re-)use, but rather follow rigid procedures for collecting Semantic Web data and working with it. We believe that Semantic Web programming must be made easier and faster such that a new RDF data source may be quickly (re-)used. For this purpose I will explain our approach towards indexing the Semantic Web, exploring it and mapping it to common programming paradigms. Core is our LITEQ approach that allows for interactively mapping RDF data structures into programming data structures such that resulting structures may be strictly typed facilitating the quick development of correct code.

Pre-requisites. Attendees should have general programming skills and should have background knowledge about data engineering such as taught in an introductory course about data management.

Mike Thelwall (U Wolverhampton), [introductory] Sentiment Strength Detection for Twitter and the Social Web


Summary. This course will offer an introduction to automatic sentiment analysis in the social web and equip participants to apply this kind of method in appropriate contexts and to understand a range of different approaches for sentiment analysis, although focusing on the SentiStrength application. The course will also discuss the particular problems associated with sentiment analysis in the social web as well as strategies to take advantage of features like emoticons to improve the automatic prediction of online sentiment.

Syllabus. The main topics will be:


Bio. Mike Thelwall is Professor of Information Science and leader of the Statistical Cybermetrics Research Group at the University of Wolverhampton, UK and a research associate at the Oxford Internet Institute. Mike has developed software and methods for gathering and analysing web data, including sentiment analysis and content analysis for Twitter, YouTube, blogs and the general web. He has published 210 refereed journal articles, seven book chapters and two books, including Introduction to Webometrics. He is an associate editor of the Journal of the American Society for Information Science and Technology and sits on three other editorial boards. More info here.

Jeffrey D. Ullman (Stanford U), [introductory] MapReduce Algorithms


Summary. We begin with an overview of a modern programming environment, involving massive numbers of inexpensive compute nodes, connected by an inexpensive network. The programming basis for such hardware is a "distributed file system," which is designed to store data reliably on cheap, failure-prone hardware. We introduce MapReduce (Hadoop) and other programming systems that are important components of this new environment, including SQL implementations (PIG. Hive) on top of MapReduce, object-stores, workflow systems, and graph-processing systems.

We then explain in detail how MapReduce works and give an important example: the join of relations. Then, we look at systems that support recursion (often called "graph-processing" systems) such as Pregel or its open-source equivalent Giraph. We look at transitive closure as an important example of nontrivial recursion and look at ways to avoid redundancy in parallel computation of the transitive closure.

Next, we consider a particular problem: an optimal algorithm for computing a multiway join in a single round of MapReduce. The critical resource in this and many other MapReduce algorithms is communication, so we view the problem as minimizing communication between the mappers and reducers. We show the problem can be set up simply using Lagrangean multipliers, and in most cases can be solved simply and exactly. We look at the exact solutions for two important classes of joins: chain joins, e.g., R(A,B) JOIN S(B,C) JOIN T(C,D), and star joins. The latter are important in analytic queries, where a very large "fact table" is joined with many smaller, but still large, "dimension tables." For star joins, not only is the optimum MapReduce algorithm easy to find, but it is always better than a cascade of two-way joins. Other multiway joins may or may not be better than a cascade of two-way joins.

We apply the multiway-join solution to the problem of computing "skew joins," where one or a few values appear so frequently that the tuples containing this values need to be handled specially, or the ability to compute in parallel is greatly limited. Interestingly, the implementations of skew join in PIG and Hive are not optimal, and we show how to handle these frequent values optimally.

Then, we develop a theory of MapReduce algorithms that enables us to quantify the tradeoff between computation and communication. We begin with a real example of what went wrong in what appeared to be a simple application of MapReduce, involving a search for drug interactions. We show how to redesign a bad algorithm for this problem, which is, in the abstract, the "all-pairs" problem of comparing each two items in a large set. We offer a family of algorithms that are each optimal for some tradeoff factor between computation and communication. Using the theory of "mapping schemas," which represent the way MapReduce algorithms are more specialized that general parallel algorithms, we are able to prove a lower bound on the "replication rate" (communication per input) as a function of "reducer size" (the largest number of inputs one reducer is allowed to receive) that exactly matches the algorithm for the all-pairs problem.

Finally, we look at similar tradeoffs for some other problems. We are able to provide a family of algorithms and a matching lower bound for (a) The "HD1" problem of finding those pairs of input bit strings that differ in exactly one bit, and (b) Matrix multiplication. In the latter case we are able to show that the two-round MapReduce algorithm for matrix multiplication is superior to any one-round algorithm.



  1. A course in database systems.
  2. Reasonable mathematical sophistication, as would be obtained in an upper-level undergraduate course in CS theory (e.g., logic, automata, or algorithms).

Bio. Jeff Ullman is the Stanford W. Ascherman Professor of Engineering (Emeritus) in the Department of Computer Science at Stanford and CEO of Gradiance Corp. He received the B.S. degree from Columbia University in 1963 and the PhD from Princeton in 1966. Prior to his appointment at Stanford in 1979, he was a member of the technical staff of Bell Laboratories from 1966-1969, and on the faculty of Princeton University between 1969 and 1979. From 1990-1994, he was chair of the Stanford Computer Science Department. Ullman was elected to the National Academy of Engineering in 1989, the American Academy of Arts and Sciences in 2012, and has held Guggenheim and Einstein Fellowships. He has received the Sigmod Contributions Award (1996), the ACM Karl V. Karlstrom Outstanding Educator Award (1998), the Knuth Prize (2000), the Sigmod E. F. Codd Innovations award (2006), and the IEEE von Neumann medal (2010). He is the author of 16 books, including books on database systems, compilers, automata theory, and algorithms.

Nitin Vaidya (U Illinois, Urbana-Champaign), [introductory/intermediate] Distributed Consensus: Theory and Applications


Summary. Consensus algorithms allow a set of nodes to reach an agreement on a quantity of interest. For instance, a consensus algorithm may be used to by a set of server replicas (implementing a fault-tolerant server) to determine the order in which to process requetss from clients, or by a network of sensors to determine the average value of samples collected by the different sensors. Research on consensus algorithms has a long history, with contributions from different research communities, including distributed computing, control systems, and social science. Consensus algorithms have also been used in practice. The exact definition of consensus varies depending on the context. In this lecture, we will discuss some fundamental results, and several distributed consensus algorithms, each of which is designed for one of the following settings:

  1. Consensus in synchronous and asynchronous systems with crash failures.
  2. Consensus in synchronous and asynchronous systems with Byzantine failures.
  3. Average consensus in presence of lossy communication links.

Bio. Nitin Vaidya received the Ph.D. from the University of Massachusetts at Amherst. He is a Professor of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. He has held visiting positions at Technicolor Paris Lab, TU-Berlin, IIT-Bombay, Microsoft Research, and Sun Microsystems, as well as a faculty position at the Texas A&M University. Nitin Vaidya has co-authored papers that received awards at several conferences, including 2007 ACM MobiHoc and 1998 ACM MobiCom. He is a fellow of the IEEE. He has served as Editor-in-Chief for the IEEE Transactions on Mobile Computing, and Editor-in-Chief for ACM SIGMOBILE publication MC2R. For more information, please visit http://www.crhc.illinois.edu/wireless/.

Philip Wadler (U Edinburgh), [intermediate] Topics in Lambda Calculus and Life


Summary. Three talks will cover a range of topics:


Bio. Philip Wadler likes to introduce theory into practice, and practice into theory. An example of theory into practice: GJ, the basis for Java with generics, derives from quantifiers in second-order logic. An example of practice into theory: Featherweight Java specifies the core of Java in less than one page of rules. He is a principal designer of the Haskell programming language, contributing to its two main innovations, type classes and monads.

Wadler is Professor of Theoretical Computer Science at the University of Edinburgh. He is an ACM Fellow and a Fellow of the Royal Society of Edinburgh, served as Chair of ACM SIGPLAN, and is a past holder of a Royal Society-Wolfson Research Merit Fellowship. Previously, he worked or studied at Stanford, Xerox Parc, CMU, Oxford, Chalmers, Glasgow, Bell Labs, and Avaya Labs, and visited as a guest professor in Copenhagen, Sydney, and Paris. He has an h-index of 59, with more than 17,500 citations to his work according to Google Scholar. He is a winner of the POPL Most Influential Paper Award, has contributed to the designs of Haskell, Java, and XQuery, and is a co-author of Introduction to Functional Programming (Prentice Hall, 1988), XQuery from the Experts (Addison Wesley, 2004) and Generics and Collections in Java (O'Reilly, 2006). He has delivered invited talks in locations ranging from Aizu to Zurich.