On Monday 18, registration begins at 08:00.

International Summer School on Web Science and Technology

Bilbao, Spain, July 18-22, 2016

Course Description


Keynotes

Courses


Keynotes


Ricardo Baeza-Yates   
Part time Professor at DTIC of the Universitat Pompeu Fabra, in Barcelona, Spain
Data and Algorithmic Bias in the Web

Summary:

The Web is the largest public big data repository that humankind has created. In this overwhelming data ocean, we need to be aware of the quality and, in particular, of the biases that exist in this data. In the Web, biases also come from redundancy and spam, as well as from algorithms that we design to improve the user experience. This problem is further exacerbated by biases that are added by these algorithms, specially in the context of search and recommendation systems. They include selection and presentation bias in many forms, interaction bias, social bias, etc. We give several examples and their relation to sparsity and privacy, stressing the importance of the user context to avoid these biases.

References:

C. Dwork, M. Hardt, T. Pitassi, O. Reingold and R. S. Zemel. Fairness through awareness. In ITCS 2012, pp. 214-226. ACM, 2012.

C. Dwork, M. Hardt, T. Pitassi, O. Reingold and R. S. Zemel. Fairness through awareness. In ITCS 2012, pp. 214-226. ACM, 2012.

Diana F. Gordon and Marie Desjardins. "Evaluation and selection of biases in machine learning."/Machine Learning/ 20, no. 1-2 (1995): 5-22.

S. Ruggieri, D. Pedreschi, and F. Turini. Data mining for discrimination discovery. In Transactions on Knowledge Discovery from Data (TKDD), 4(2), 2010.

US Government Report on Big Data: Algorithmic systems, Opportunity and Civil rights,https://www.whitehouse.gov/sites/default/files/microsites/ostp/2016_0504_data_discrimination.pdf, 2016

Workshop on Fairness, Accountability and Transparency in Machine Learning, http://www.fatml.org/, 2015

Pre-requisites:

Basic knowledge of data management

Short bio:

Ricardo Baeza-Yates areas of expertise are information retrieval, web search and data mining, data science and algorithms. He was VP of Research at Yahoo Labs, based in Barcelona, Spain, and later in Sunnyvale, California, from January 2006 to February 2016. He is part time Professor at DTIC of the Universitat Pompeu Fabra, in Barcelona, Spain, as well as at DCC of Universidad de Chile in Santiago. Until 2004 he was Professor and founding director of the Center for Web Research at the later place. He obtained a Ph.D. in CS from the University of Waterloo, Canada, in 1989. He is co-author of the best-seller Modern Information Retrieval textbook published by Addison-Wesley in 2011 (2nd ed), that won the ASIST 2012 Book of the Year award. From 2002 to 2004 he was elected to the board of governors of the IEEE Computer Society and in 2012 he was elected for the ACM Council. Since 2010 is a founding member of the Chilean Academy of Engineering. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow, among other awards and distinctions.


Jiawei Han   
Abel Bliss Professor of Computer Science, University of Illinois at Urbana-Champaign
From Data to Knowledge: A Data-to-Network-to-Knowledge (D2N2K) Paradigm

Summary:

The real-world big data are largely unstructured but interconnected, mainly in the form of natural language text. One of the grand challenges is to turn such massive data into actionable knowledge. In order to turn such massive unstructured, text-rich, but interconnected data into knowledge, we propose a D2N2K (i.e., data-to-network-to-knowledge) paradigm, that is, first turn data into relatively structured heterogeneous information networks, and then mine such text-rich and structure-rich heterogeneous networks to generate useful knowledge. We show why such a paradigm represents a promising direction and present some recent progress on the development of effective methods for construction and mining of structured heterogeneous information networks from text data. We argue that network science is the key at turning massive unstructured data into structured knowledge.

References:

A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15

J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15

J. Liu, X. Ren, J. Shang, T. Cassidy, C. Voss and J. Han, Representing Documents via Latent Keyphrase Inference. WWW'16

Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012

X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, H. Ji and J. Han. ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering. KDD’15.

C. Wang and J. Han, Mining Latent Entity Structures, Morgan & Claypool Publishers 2015

Pre-requisites:

Basic knowledge about network science and data mining.

Short bio:

Jiawei Han, Abel Bliss Professor of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, database systems, and data warehousing, with over 700 journal and conference publications. He has chaired or served on many program committees of international conferences, including PC co-chair for KDD, SDM, and ICDM conferences, and Americas Coordinator for VLDB conferences. He also served as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data and is serving as the Director of Information Network Academic Research Center supported by U.S. Army Research Lab, and Director of KnowEnG, a BD2K (Big Data to Knowledge) center supported by NIH. He is a Fellow of ACM and Fellow of IEEE. He received 2004 ACM SIGKDD Innovations Award, 2005 IEEE Computer Society Technical Achievement Award, 2009 IEEE Computer Society Wallace McDowell Award, and 2011 Daniel C. Drucker Eminent Faculty Award at UIUC. His book "Data Mining: Concepts and Techniques" has been used popularly as a textbook worldwide.


Prabhakar Raghavan   
Vice President, Google Apps
Three Vignettes from the Theory and Practice of Large Data Analysis

In this lecture we review three analytical results on recommendation systems, as well as experiments on the market behavior of such systems. We discuss connections between recommendations and personalized pagerank, as well as ideas from game theory.

Short bio:

Prabhakar Raghavan is Vice President of Google Apps, with responsibilities including Gmail, Google Docs and Drive and Calendar. Raghavan's research interests include text and web mining, and algorithm design. He is a former consulting professor of computer science at Stanford University and editor-in-chief of the Journal of the ACM. He has co-authored two textbooks, on randomized algorithms and on information retrieval. Raghavan received his PhD from the University of California, Berkeley and is a member of the US National Academy of Engineering and a fellow of the ACM and of the IEEE; he has also been awarded a Laurea ad honorem by the University of Bologna. Prior to joining Google, he had held positions as the Head of Yahoo! Labs, the chief technology officer at Verity and at IBM Research.


Amit Sheth   
LexisNexis Ohio Eminent Scholar, Wright State University, USA
Semantic, Cognitive and Perceptual Computing – three intertwined strands of a golden braid of intelligent computing

Summary:

While Bill Gates, Stephen Hawking, Elon Musk, Peter Thiel and others engaged in OpenAI discuss whether or not AI, robots, and machines will replace humans, proponents of human-centric computing continue to extend work in which humans and machine partner in contextualized and personalized processing of multimodal data to derive actionable information. In this talk, we discuss how maturing paradigms such as semantic computing (SC), cognitive computing (CC), complemented by the emerging perceptual computing (PC) paradigm provide a continuum through which to exploit the ever-increasing and growing diversity of data that could enhance people’s daily lives. SC and CC sift through raw data to personalize it according to context and individual user, creating abstractions that move the data closer to what humans can readily understand and apply in decision-making. PC, which interacts with the surrounding environment to collect data that is relevant and useful in understanding the outside world, is characterized by interpretative and exploratory activities, that is supported by use of prior/background knowledge. Using the examples of personalized digital health and smart city, we will demonstrate how SC, CC and PC form complementary capabilities that will enable development of next generation of intelligent systems.

References:

Amit Sheth, "Computing for Human Experience: Semantics-Empowered Sensors, Services, Social Computing on the Ubiquitous Web," IEEE Internet Computing, 14 (1), January/February 2010.

Amit ShethPramod AnantharamCory Henson, Semantic, Cognitive, and Perceptual Computing: Advances toward Computing for Human Experience, to appear in IEEE Computer. Preprint: http://arxiv.org/abs/1510.05963

Amit Sheth, Internet of Things to Smart IoT Through Semantic, Cognitive, and Perceptual Computing, IEEE Intelligent Systems, March/April 2016.

Pre-requisites:

None, but introductory background to and interest in Semantic Web or AI would helpful.

Short bio:

Prof. Amit Sheth is an Educator, Researcher and Entrepreneur. He is the LexisNexis Ohio Eminent Scholar, an IEEE Fellow, and the executive director of Kno.e.sis the Ohio Center of Excellence in Knowledge-enabled Computing at Wright State University. Kno.e.sis has ~80 researchers including 15 faculty and over 60 funded students. In World Wide Web (WWW), it is placed among the top 10 universities in the world based on 10-yr impact. He has founded two companies, continues to advise/direct startups in semantics and healthcare; several commercial products and deployed systems have resulted from his research. His former students are exceptionally successful as academics in research universities, researchers in industry and successful entrepreneurs; average citations for his first 18 past PhD students exceed 1,425. See: http://j.mp/Kimpact

Courses


Tim Baldwin   
Professor in the Department of Computing and Information Systems, The University of Melbourne.
Social Media and Text Analytics

Summary:

This course will introduce students to natural language processing (NLP) in the context of social media. It will cover text preprocessing tasks (incl. language identification, lexical normalisation and named entity recognition), user metadata prediction tasks (incl. user geolocation and demographic variables), and NLP enhancements to end-user applications (incl. search over web user forums, real-time duplicate question detection, and trend analysis over text streams).

Syllabus:

  1. Introduction to social media and natural language processing for social media
  2. Preprocessing of social media text
    • Language identification
    • Lexical normalisation
    • POS tagging and named entity recognition
  3. User metadata prediction
    • User geolocation
    • Prediction of demographic variables
  4. NLP enhancements to social media end-user applications
    • Search over web user forums
    • Real-time duplicate question detection
    • Trend analysis over text streams
  5. Restrictions and ethics of social media usage
  6. Future directions for NLP over social media sources

References:

To be provided in course slides

Pre-requisites:

Basic knowledge of natural language processing and machine learning.

Short bio:

Tim Baldwin is a Professor in the Department of Computing and Information Systems, The University of Melbourne, and an Australian Research Council Future Fellow. He has previously held visiting positions at Cambridge University, University of Washington, University of Tokyo, Saarland University, NTT Communication Science Laboratories, and National Institute of Informatics. His research interests include text mining of social media, computational lexical semantics, information extraction and web mining, with a particular interest in the interface between computational and theoretical linguistics. Current projects include web user forum mining, monitoring and text mining of Twitter, and text analytics for the creative industries.
Tim completed a BSc(CS/Maths) and BA(Linguistics/Japanese) at The University of Melbourne in 1995, and an MEng(CS) and PhD(CS) at the Tokyo Institute of Technology in 1998 and 2001, respectively. Prior to joining The University of Melbourne in 2004, he was a Senior Research Engineer at the Center for the Study of Language and Information, Stanford University (2001-2004).


Vassilis Christofidis   
Professor of Computer Science at the University of Crete, Greece
Entity Resolution in the Web of Data

Summary:

Over the past decade, numerous knowledge bases (KBs) have been built to power a new generation of Web applications that provide entity-centric search and recommendation services. These KBs offer comprehensive, machine-readable descriptions of a large variety of real-world entities (e.g., persons, places, products, events) published on the Web as Linked Data (LD). Even when derived from the same data source (e.g., a Wikipedia entry), KBs such as DBpedia, YAGO2, or Freebase may provide multiple, non-identical descriptions for the same real-world entities. This is due to the different information extraction tools and curation policies employed by KBs, resulting to complementary and sometimes conflicting entity descriptions. Entity resolution (ER) aims to identify different descriptions that refer to the same real-world entity, and emerges as a central data-processing task for an entity-centric organization of Web data. ER is needed to enrich interlinking of data elements describing entities, even by third-parties, so that the Web of data can be accessed by machines as a global data space using standard languages, such as SPARQL. ER can also facilitate an automated KB construction by integrating entity descriptions from legacy KBs with Web content published as HTML documents.
ER has attracted significant attention from many researchers in information systems, database and machine-learning communities. The objective of this lecture is to present the new ER challenges stemming from the Web openness in describing, by an unbounded number of KBs, a multitude of entity types across domains, as well as the high heterogeneity (semantic and structural) of descriptions, even for the same types of entities. The scale, diversity and graph structuring of entity descriptions published according to the LD paradigm challenge the core ER tasks, namely, (i) how descriptions can be effectively compared for similarity and (ii) how resolution algorithms can efficiently filter the candidate pairs of descriptions that need to be compared.
In a multi-type and large-scale entity resolution, we need to examine whether two entity descriptions are somehow (or near) similar without resorting to domain- specific similarity functions and/or mapping rules. Furthermore, the resolution of some entity descriptions might influence the resolution of other neighbourhood descriptions. This setting clearly goes beyond deduplication (or record linkage) of collections of descriptions usually referring to a single entity type that slightly differ only in their attribute values. It essentially requires leveraging similarity of descriptions both on their content and structure. It also forces us to revisit traditional ER workfows consisting of separate indexing (for pruning the number of candidate pairs) and matching (for resolving entity descriptions) phases.
In this talk we intend to provide a concise overview for researchers, students and developers who are interested in a global view of the ER problem in the Web of data.

Syllabus:

  1. Describing and Linking Entities: Linked Knowledge Bases and Entity-Centric Applications
    • Introduction to Entities and Knowledge
    • Entity Graphs and the Web of Data
    • Web Knowledge Bases Construction and Curation
    • Web-scale Entity Resolution (ER) challenge
    • ER Problems and Applications
    • ER Paradigms and Workflows
  2. Matching and Resolving Entities (I): Entity Similarity and Blocking Techniques
    • Entity Similarity Functions
    • Content-based
    • Structure-based
    • Approximative Techniques
    • Blocking Frameworks for Web-scale ER
    • Token-based
    • Attribute-based
    • URL-based
    • Block Post-processing
    • Critical Assesment of Blocking Techniques
  3. Matching and Resolving Entities (II): Iterative and Progressive Resolution Techniques
    • Graph-based Iterative Algorithms
    • Relational Entities
    • Linked Entities
    • Graph-based Progressive Algorithms
    • Open Challenges and Conclusions

References:

Vassilis Christophides, Vasilis Efthymiou, Kostas Stefanidis: Entity Resolution in the Web of Data. Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool Publishers 2015 (http://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=823).

Short bio:

Vassilis Christophides is Professor of Computer Science at the University of Crete. He has been recently appointed to an advanced research position at INRIA Paris. Previously, he worked as Distinguished Scientist at Technicolor, R&I Center in Paris. He studied Electrical Engineering at the National Technical University of Athens (NTUA), Greece, July 1988, he received his DEA in computer science from the University PARIS VI, June 1992, and his Ph.D. from the Conservatoire National des Arts et Metiers (CNAM) of Paris, October 1996. He has published over 120 articles in high-quality international conferences, journals and workshops. He has been scientific coordinator of a number of research projects funded by the European Union, the Greek State and private foundations on the Semantic Web and Digital Preservation at the Institute of Computer Science of FORTH. He has received the 2004 SIGMOD Test of Time Award and the Best Paper Award at the 2nd and 6th International Semantic Web Conference in 2003 and 2007. He served as General Chair of the joint EDBT/ICDT Conference in 2014 at Athens and as Area Chair for the ICDE ìSemi-structured, Web, and Linked Data Managementî track in 2016 at Bali, Indonesia.


Brian D. Davison   
Associate Professor. Department of Computer Science and Engineering, College of Engineering and Applied Science, Lehigh University
Useful Web Mining with R

Summary:

With an open source history and extensive set of contributed libraries, R provides a flexible, powerful and popular environment for data science. In this short course, we introduce new users to R and show how R can facilitate web data collection, analysis and visualization.

Syllabus:

  1. Introduction to R programming
  2. Using Libraries in R
  3. Data Mining Libraries
  4. Mining Web Data
  5. Visualization with R

References:

Nina Zumel and John Mount. Practical Data Science with R. Manning, 2014.

Brett Lantz. Machine Learning with R. Packt Publishing, 2013

Graham Williams. R Programming and Data Science. Chapman & Hall/CRC Press, forthcoming.

Yanchang Zhao. R and Data Mining: Examples and Case Studies. Academic Press, 2012.

Pre-requisites:

Introduction to data mining techniques and some programming experience. No prior knowledge of R is required.

Short bio:

Brian D. Davison is an associate professor of computer science and engineering and teaches courses on data science, data mining, web search engines, web mining, networking, system administration, and C and UNIX programming. He heads Lehigh's Web Understanding, Modeling, and Evaluation (WUME) laboratory and serves as joint editor-in-chief of the ACM journal Transactions on the Web. While on sabbatical during the 2013-2014 academic year, he worked in the Core Data Science group at Facebook. Dr. Davison earned his B.S. from Bucknell University and his M.S. and Ph.D. in Computer Science from Rutgers University. His research includes web search and mining, focusing on search, recommendation and classification problems on the Web and social networks. He is an NSF Faculty Early CAREER award winner. Dr. Davison's research has been supported by the National Science Foundation, the Defense Advanced Research Projects Agency, Microsoft, and Sun Microsystems.


Marco Gori   
Professor of computer science at the University of Siena, Italy
Learning semantic-based structures from textual sources

Summary:

The course gives a big picture on bridging knowledge-based representational formalisms and machine learning, with the purpose of opening the doors towards new semantic-based methods to attack information retrieval.

Syllabus:

  1. Lecture 1: Bridging logic and machine learning for text processing
    In this lecture, I discuss methods to bridge logic formalisms and machine learning. The purpose of the lecture is open the mind towards the breaking of the wall which separates the schools of thoughts driven by logic and continuous math. After a brief survey, the emphasis is given on the unified notion of constraint, that is used both in the expression of symbolic knowledge and in the formalization of learning algorithms. Examples from text processing are given in which we give the big picture on the transition of information retrieval to semantic-based access of the information.
  2. Lecture 2: Semantic-based regularization
    In this lecture, I present the theory of learning from constraints as a general framework to attacking general problems of text processing which involve semantics. I give some representation theorems that extend the classic framework of regularization in such a way to incorporate logic formalisms, like first-order logic. This is made possible by the unification of continuous and discrete computational mechanisms in the same functional framework, so as any stimulus, like supervised examples and logic predicates, are translated into constraints. Finally, it is shown how deep neural networks can be trained in this more general semantic-based framework.
  3. Lecture 3: The emergence of semantics
    In this lecture, I present experimental results of the theory in a number of problems that involve text processing. In addition to classic problems, like text categorization, it is shown that we can carry out constraint satisfaction, so as the proposed intelligent agents can drive conclusions that involve semantics.

References:

Interesting resources on bridging symbolic and sub-symbolic representation of the information that involve machine learning can be found at http://www.neural-symbolic.org/

Pre-requisites:

A preliminary background on constraint programming does help, but is not necessary.

Short bio:

Marco Gori received the Ph.D. degree in 1990 from UniversitaÌ di Bologna, Italy, working partly at the School of Computer Science (McGill University, Montreal). In 1992, he became an Associate Professor of Computer Science at UniversitaÌ di Firenze and, in November 1995, he joint the UniversitaÌ di Siena, where he is currently full professor of computer science. His main interests are in machine learning with applications to pattern recognition, Web mining, and game playing. He is especially interested in bridging logic and learning and in the connections between symbolic and sub-symbolic representation of information. He was the leader of the WebCrow project for automatic solving of crosswords, that outperformed human competitors in an official competition which took place during the ECAI-06 conference. As a follow up of this grand challenge he founded QuestIt, a spin-off company of the University of Siena, working in the field of question-answering. He is co-author of the book "Web Dragons: Inside the myths of search engines technologies," Morgan Kauffman (Elsevier), 2006. Dr. Gori serves (has served) as an Associate Editor of a number of technical journals related to his areas of expertise, he has been the recipient of best paper awards, and keynote speakers in a number of international conferences. He was the Chairman of the Italian Chapter of the IEEE Computational Intelligence Society, and the President of the Italian Association for Artificial Intelligence. He is a fellow of the IEEE, ECCAI, IAPR. He is in the list of top Italian scientists kept by the VIA-Academy (http://www.topitalianscientists.org/top_italian_scientists.aspx) Marco Gori University of Siena, Italy Marco Gori received the Ph.D. degree in 1990 from UniversitaÌ di Bologna, Italy, working partly at the School of Computer Science (McGill University, Montreal). In 1992, he became an Associate Professor of Computer Science at UniversitaÌ di Firenze and, in November 1995, he joint the UniversitaÌ di Siena, where he is currently full professor of computer science. His main interests are in machine learning with applications to pattern recognition, Web mining, and game playing. He is especially interested in bridging logic and learning and in the connections between symbolic and sub-symbolic representation of information. He is the leader of the WebCrow project for automatic solving of crosswords, that outperformed human competitors in an official competition which took place within the ECAI-06 conference. As a follow up of this grand challenge, he founded QuestIt, a spin-off company of the University of Siena, working in the field of question-answering. He is co-author of the book “Web Dragons: Inside the myths of search engines technologies,” Morgan Kauffman (Elsevier), 2006. Dr. Gori serves (has served) as an Associate Editor of a number of technical journals related to his areas of expertise, he has been the recipient of best paper awards, and keynote speakers in a number of international conferences. He was the Chairman of the Italian Chapter of the IEEE Computational Intelligence Society, and the President of the Italian Association for Artificial Intelligence. He is a fellow of the IEEE, ECCAI, IAPR. He is in the list of top Italian scientists kept by the VIA-Academy (http://www.topitalianscientists.org/top_italian_scientists.aspx)


Alon Y. Halevy   
Affiliate Professor, Computer Science and Engineering, Executive director of the Recruit Institute of Technology in the San Francisco Bay Area, USA
Structured Data on the Web

Summary:

This course surveys recent work on providing answers from structured data in Web search. We will see the different forms that structured data appears in search and the unique challenges that arise in this context. We'll dig deeper into how structured data is found on the Web (in HTML tables) and how to find good tables. We'll cover some of the principles of data integration that are relevant to Web search and will discuss some emerging topics in this new area.

Syllabus:

Different forms of structured data in web search results, Getting structured data out of Web content, Mining high quality HTML tables, principles of data integration, emerging topics in structured data on the Web.

References:

(Optional) Principles of Data Integration (Anhai Doan, Alon Halevy and Zachary Ives, Morgan Kaufmann, 2012).

Short bio:

Alon Halevy is the Executive Director of the Recruit Institute of Technology. Prior to that, he headed the Structured Data Management Research group at Google and before that he was a Professor of Computer Science at the University of Washington, Seattle, where he founded the database group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Infor-mation Integration space, and in 2004, he founded Transformic, a company that created search engines for the deep web and was acquired by Google. Dr. Halevy is a Fellow of the Association for Computing Machinery, received the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). He received his Ph.D. in Computer Science from Stanford University in 1993 and his Bachelor’s from Hebrew University in Jerusalem. Halevy is also a coffee culturalist and authored the book "The Infinite Emotions of Coffee," published in 2011; and he is a co-author of the book "Principles of Data Integration," published in 2012.


Andreas Hotho   
Professor at the University of Würzburg and the head of the DMIR group.
Semantics in the Social Web

Summary:

The Social Web is a rich resource of semantic information heavily influenced by its users. In the last years, research has worked on methods to reveal the hidden semantic information from weakly structured sources like tagging systems or social encyclopedias like Wikipedia. The course is about understanding the emergence and the extraction of semantics from the Social Web and its relationship to the Semantic Web. After an introduction into the Social and Semantic Web, we will focus on two main sources of information: tags from social bookmarking systems and Wikipedia. We will show that we can find and extract semantics in such systems. Beside the content users provide, we will investigate the user's behaviour which is another important factor. We will learn how users influence semantics and how we can extract it from the observed behaviour. Finally, we will learn how we can compare hypotheses about human behavior on the web and study the influence of semantics on it.

Syllabus:

  1. Introduction
    1. Web, Social Web and Semantic Web
      1. Tagging Systems and Folksonomies
      2. Wikipedia
      3. Ontologies
    2. Semantic and Social Web - Comparing Folksonomies and Ontologies
  2. Learning Semantic from the Social Web
    1. Properties of Folksonomy
      1. Understanding the Network
      2. Properties of Tags, Users, Resources Types
    2. Extracting Semantics from Folksonomies
      1. Association Rules
      2. Measures of Tag Relatedness
      3. Learning Approaches
    3. Extracting Semantics from Wikipedia
  3. User behaviour and semantics
    1. Categorizers/Describers in Folksonomies
    2. Extracting semantics from Wikipedia navigation
    3. Influence of semantics on user behaviour on the Web

References:

Benz, D.; Hotho, A.; Jäschke, R.; Krause, B.; Mitzlaff, F.; Schmitz, C. & Stumme, G. (2010), 'The Social Bookmark and Publication Management System BibSonomy', The VLDB Journal 19 (6), 849--875.
Cattuto, C.; Benz, D.; Hotho, A. & Stumme, G. (2008), Semantic Grounding of Tag Relatedness in Social Bookmarking Systems, in 'Proceedings of the 7th International Conference on The Semantic Web' , pp. 615--631.
Körner, C.; Benz, D.; Strohmaier, M.; Hotho, A. & Stumme, G. (2010), Stop Thinking, start Tagging - Tag Semantics emerge from Collaborative Verbosity, in 'Proceedings of the 19th International World Wide Web Conference (WWW 2010)' , ACM, Raleigh, NC, USA.
Niebler, T., Schlör, D., Becker, M., Hotho, A.: Extracting Semantics from Unconstrained Navigation on Wikipedia. Künstliche Intelligenz. (2015).
Singer, P.; Niebler, T.; Strohmaier, M. & Hotho, A. (2013), 'Computing Semantic Relatedness from Human Navigational Paths: A Case Study on Wikipedia', International Journal on Semantic Web and Information Systems (IJSWIS) 9 (4), 41--70.
Singer, P.; Helic, D.; Hotho, A. & Strohmaier, M. (2015), Hyptrails: A bayesian approach for comparing hypotheses about human trails, in '24th International World Wide Web Conference (WWW2015)' , ACM, Firenze, Italy.
Staab, S. & Studer, R., ed. (2004), Handbook on Ontologies , Vol. 10, Springer .

Short bio:

Andreas Hotho is professor at the University of Würzburg. He holds a Ph.D. from the University of Karlsruhe, where he worked from 1999 to 2004 at the Institute of Applied Informatics and Formal Description Methods (AIFB) in the areas of text, data, and web mining, semantic web and information retrieval. He earned his Master’s Degree in information systems from the University of Braunschweig in 1998. From 2004 to 2009 he was a senior researcher at the University of Kassel. He joined the L3S in 2011. Since 2005 he has been leading the development of the social bookmark and publication sharing platform BibSonomy. Andreas has published over 100 articles in journals and at conferences, co-edited several special issues and books, and co-chaired several workshops. He worked as a reviewer for journals and was a member of many international conferences and workshops program committees. His research focuses on the combination of data mining, information retrieval and the semantic web. Further, he is interested in the analysis of social media systems, in particular folksonomies, tagging, and sensor data emerging trough ubiquitous and social activities. As the World Wide Web is one of his main application areas, his research contributes to the field of web science.


Jiawei Han   
Abel Bliss Professor of Computer Science, University of Illinois at Urbana-Champaign
Construction and Mining of Text-Rich Heterogeneous Information Networks

Summary:

Massive amounts of data are natural language text-based, unstructured, noisy, untrustworthy, but are interconnected, potentially forming gigantic, interconnected information networks. If such text-rich data can be processed and organized into multiple typed, semi-structured heterogeneous information networks, organized knowledge can be mined from such networks. Most real world applications that handle big data, including interconnected social networks, medical information systems, online e-commerce systems, or Web-based forum and data systems, can be structured into typed, heterogeneous social and information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective analysis of large-scale, text-rich heterogeneous information networks poses an interesting but critical challenge.

In this course, we present an overview of recent studies on construction and mining of text-rich heterogeneous information networks. We show that relatively structured heterogeneous information networks can be constructed from unstructured, interconnected, text data, and such relatively structured, heterogeneous networks brings tremendous benefits for data mining. Departing from many existing network models that view data as homogeneous graphs or networks, the text-based, semi-structured heterogeneous information network model leverages the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge from interconnected data. This heterogeneous network modeling will lead to the discovery of a set of new principles and methodologies for mining text-rich, interconnected data. We will also point out some promising research directions and provide arguments on that construction and mining of text-rich heterogeneous information networks could be a key to information management and mining.

Syllabus:

  1. Why Construction and Mining Heterogeneous Information Networks?
  2. Construction of Heterogeneous Networks from Text Data
    1. Phrase Mining and Topic Modeling from Large Corpora
    2. Entity Extraction and Typing by Relational Graph Construction and Propagation
  3. Mining Heterogeneous Information Networks
  4. Data → Network → Knowledge (D2N2K): A Path for Data to Knowledge

References:

A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15

J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15

J. Liu, X. Ren, J. Shang, T. Cassidy, C. Voss and J. Han, Representing Documents via Latent Keyphrase Inference. WWW'16

Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012

X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, H. Ji and J. Han. ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering. KDD’15.

C. Wang and J. Han, Mining Latent Entity Structures, Morgan & Claypool Publishers 2015

Pre-requisites:

Basic knowledge about network science and data mining.

Short bio:

Jiawei Han, Abel Bliss Professor of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, database systems, and data warehousing, with over 700 journal and conference publications. He has chaired or served on many program committees of international conferences, including PC co-chair for KDD, SDM, and ICDM conferences, and Americas Coordinator for VLDB conferences. He also served as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data and is serving as the Director of Information Network Academic Research Center supported by U.S. Army Research Lab, and Director of KnowEnG, a BD2K (Big Data to Knowledge) center supported by NIH. He is a Fellow of ACM and Fellow of IEEE. He received 2004 ACM SIGKDD Innovations Award, 2005 IEEE Computer Society Technical Achievement Award, 2009 IEEE Computer Society Wallace McDowell Award, and 2011 Daniel C. Drucker Eminent Faculty Award at UIUC. His book "Data Mining: Concepts and Techniques" has been used popularly as a textbook worldwide.


Ravi Kumar   
Google, Montain View, Ca
Computing at Scale: Models and Algorithms

Summary:

Traditional computational models are inadequate for efficiently processing large amounts of data, and one has to resort to novel computational models. In this course we will focus on MapReduce, a popular such model. We will illustrate its power by presenting MapReduce algorithms for some basic graph problems. In addition, we will also explore the connections between MapReduce and other paradigms for processing massive data.

Syllabus:

Pre-requisites:

Undergraduate algorithms.

Short bio:

Ravi Kumar has been a senior staff research scientist at Google since 2012. Prior to this, he was a research staff member at the IBM Almaden Research Center and a principal research scientist at Yahoo! Research. His research interests include Web search and data mining, algorithms for massive data, and the theory of computation.


Haewoon Kwak   
Research Scientist, Qatar Computing Research Institute, HBKU
From social network analysis to social media analytics and beyond: challenges and opportunities

Summary:

Since social media became popular, one of the fascinating things for researchers is to get underlying social networks in large-scale. Such network structures make it possible to quantitatively understand the structure and the dynamics of societal systems. In this course, you will learn about the important concepts and tools to understand social media from very basic ones to more advanced ones.

Syllabus:

References:

Pre-requisites:

There are no math or programming prerequisites for this lecture.

Short bio:

Dr. Haewoon Kwak is a research scientist at Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University with a background in computer science, and has studied online social networks for many years. He is a chair of social networks and graph analysis track in WWW 2016. He is a lead author of “What is Twitter, a social network or news media?” (WWW 2010), which has been cited more than 3,500 times. Like his Twitter paper, he is always exploring new areas, such as his pioneering work on YouTube and other online social networks. At QCRI, he focuses on computational journalism, with special emphasis on bias of news media. His recent work is presented in Computational Journalism Symposium 2014 and 2015.


Mirco Musolesi   
Reader in Data Science Department of Geography, University College London, UK.
Mining Big (and Small) Mobile Data

Summary:

We constantly leave “digital traces” in our daily lives, both in online and offline worlds. Posts in online social networks, mobile sensor data, Open Data repositories are just a few examples of the variety of data sources that are available to practitioners and researchers. Often, this information is also associated to specific geographic locations. Examples are GPS trajectories collected using mobile and wearable devices or geolocalized posts in online social networks. This data can be collected, analyzed and exploited for many practical applications with high commercial and societal impact. This course will provide an in-depth overview of the theoretical foundations, algorithms, systems and tools for mining social and geographic datasets collected by means of mobile devices or through the cellular infrastructure.

Syllabus:

References:

A full list of relevant articles will be provided to the students during the module.

Key research papers are the following:

Pre-requisites:

Basic knowledge of mathematics/descriptive statistics and foundations of computer science.

Short bio:

Mirco Musolesi is a Reader in Data Science at the Department of Geography at University College London. He received a PhD in Computer Science from University College London and a Master in Electronic Engineering from the University of Bologna. He held research and teaching positions at Dartmouth College, Cambridge, St Andrews and Birmingham. He is a computer scientist with a strong interest in sensing, modelling, understanding and predicting human behaviour and dynamics in space and time, at different scales, using the "digital traces" we generate daily in our online and offline lives. He is interested in developing mathematical and computational models as well as implementing real-world systems based on them. This work has applications in a variety of domains, such as intelligent systems design, and ubiquitous computing, networked systems, (cyber)security&privacy, and data analytics for “social good”. More details about his research profile can be found at: http://www.ucl.ac.uk/~ucfamus/


Prabhakar Raghavan   
Vice President, Google Apps
Introduction to Web Search Engines

Summary:

This course covers the basics of web search engines. The goals will be to provide sufficient information for practically inclined students to build a simple search engine by the end of the second lecture.

Syllabus:

References:

Introduction to Information Retrieval by Manning, Raghavan, Schutze.

Pre-requisites:

Basic algorithms and data structures; prefer knowledge of Python or Java.

Short bio:

Prabhakar Raghavan is Vice President of Google Apps, with responsibilities including Gmail, Google Docs and Drive and Calendar. Raghavan's research interests include text and web mining, and algorithm design. He is a former consulting professor of computer science at Stanford University and editor-in-chief of the Journal of the ACM. He has co-authored two textbooks, on randomized algorithms and on information retrieval. Raghavan received his PhD from the University of California, Berkeley and is a member of the US National Academy of Engineering and a fellow of the ACM and of the IEEE; he has also been awarded a Laurea ad honorem by the University of Bologna. Prior to joining Google, he had held positions as the Head of Yahoo! Labs, the chief technology officer at Verity and at IBM Research.


Uli Sattler   
Information Management Group, University of Manchester, UK
OWL, Underlying Logics, and What This Reasoning Is All about

Summary:

In this course, we will give an introduction to OWL and the underlying Description Logics, with a special emphasis on reasoning, what it is used for, and how it is realised. We will discuss different kinds of reasoning, according to their tasks, quality criteria, and reasoning techniques. The tasks include terminological reasoning as well as ontology-based data access (OBDA), module extraction, and entailment explanation. Quality criteria include soundness, completeness. etc. as well as different performance related ones. Finally, we also plan to sketch out some different reasoning techniques to give a basic understanding of how reasoners are performing these reasoning tasks.
A basic understanding of propositional logic would be good and knowledge of first order logic would be even btter. Throughout the class, we will make use of running examples to give some high-level understanding.

Syllabus:

References:

Pre-requisites:

No strict pre-requisites, but some knowledge of propositional/Boolean logic would be good.

Short bio:

Uli has been working in Description Logics and related subject for over 20 years. She has contributed to the design of the family of Description Logics underlying OWL, including SHIQ and SROIQ, to the development of tableau algorithms for these logics, and to the development of OWL. Her current interests include the usage of OWL in ontology-based information systems for various applications, and the requirements of these applications, e.g., for modularity, entailment explanation, and ontology learning. She also likes mountains, both with and without snow.


Barry Smith   
National Center for Ontological Research (NCOR).
Towards Ontological Foundations for Web Science

Summary:

The web has resulted in an enormous proliferation of new sorts of human interaction, which continue to expand in scope and complexity and which are spawning new sorts of skills, new institutions, new values, new sorts of money, new methods of control, and new opportunities for massive human agency in areas such as science, journalism, education, art, crime, commerce, law, medicine, terrorism, defense and intelligence. While ontologies of some sophistication have been developed ñ above all in fields such as biology and medicine ñ to cope with high-complexity interactions of many different sorts, no comparably sophisticated ontology resources exist that are able to deal with human interactions and with the artifacts to which these interactions give rise at ever larger scales. We will attempt to fill this gap, starting with a survey of the simple building blocks of a theory of human interaction and moving from there to the treatment of a series of more complex examples.

Syllabus:

  1. Building blocks of human interaction:
    1. speech acts, social acts, document acts, internet acts
      trust, reputation, punishment, authority, expertise, prestige, obligation, claim, value, credit
  2. The ontology of organizations
  3. Massive shared agency: Control of human interactions at internet scale
  4. Examples: Bitcoin, Uber, Tripadvisor, Wikileaks, Ö
  5. What is Web Science?

References:

Robert Arp, Barry Smith and Andrew Spear, Building Ontologies with Basic Formal Ontology, Cambridge, MA: MIT Press, August 2015, xxiv + 220pp.

Barry Smith, “Towards a Science of Emerging Media”, Philosophy of Emerging Media: Understanding, Appreciation and Application, edited by J. E. Katz and J. Floyd, Oxford: Oxford University Press, December 2015, 29-48.

Barry Smith, "Document Acts”, in A. Konzelmann-Ziv and H. Bernhard Schmid (eds.), Institutions, Emotions, and Group Agents. Contributions to Social Ontology, Dordrecht: Springer, 2014, 19-31.

Pre-requisites:

None

Short bio:

Barry Smith is a prominent contributor to both theoretical and applied research in ontology. He is Director of the National Center for Ontological Research and Professor in the Departments of Philosophy, Biomedical Informatics, Computer Science, and Neurology in the University at Buffalo.
Smith is the author of some 500 publications on ontology and related topics. His research has been funded by the National Institutes of Health, the US, Swiss and Austrian National Science Foundations, the Volkswagen Foundation, the Humboldt Foundation, the European Union, and the US Department of Defense. Since 2000 he has served as consultant to Hernando de Soto, Director of the Institute for Liberty and Democracy in Peru, on projects ñ most recently involving the use of blockchain technology ñ to advance property and business rights among the poor in developing countries by providing secure title.
Smith's pioneering work on the science of ontology led to the establishment of Basic Formal Ontology (BFO) as the most commonly adopted upper-level ontology development framework. The methodology underlying BFO is now being applied in a range of different domains, including biomedicine, military intelligence, engineering, and sustainable development.


Raphael Volz   
University of Pforzheim (Germany)
Improving Prediction Models with Open Data

Summary:

This course will introduce students to prediction models and how these models can be improved by incorporating open data. Students learn how to obtain prediction models from data using popular supervised machine learning techniques (linear / logistic regression, decision trees / random forest, deep learning / feed-forward neuronal networks) and explore scenarios such as parking demand forecasting, fuel price predictions, and flight delay forecasting using R and the H2O.ai machine learning platform.
We then cover important open data sets (OpenStreetMap, Wikidata, weather data) and show how these data sets can be used to improve our prediction models.

Syllabus:

  1. Introducing predictive analytics
    • Objectives
    • Methodology
    • Prediction Models
  2. Supervised learning with machines and data
    • Linear and logistic regression
    • Decision Trees / Random Forest
    • Deep Learning / Neuronal Networks
  3. Creating Prediction Models with Software [HANDS ON R and H2O]
    • Parking Demand Forecasting
    • Fuel Price Predictions
    • Flight Delay Forecasting
  4. Incorporating Open Data Sets
    • OpenStreetMap
    • Wikidata
    • Weather Data
  5. Summary and outlook

References:

To be provided in course slides and online course material

Pre-requisites:

Basic knowledge of statistics and data processing

Short bio:

Raphael Volz is an educator, inventor and entrepreneur. He teaches courses in data science, machine learning, and cloud computing at the University of Pforzheim (Germany) where he is a full professor of applied informatics. He also serves as chairman of the board for Volz Innovation GmbH, a consulting firm focused on technology-based product innovation. In 2009, he founded nogago GmbH, a software company devoted to outdoor navigation using open data and open source apps. He obtained his doctorate at the Karlsruhe Institute of Technology (KIT) in 2004 researching the intersection of Description Logic and Logic Programming and previously investigated how ontologies can be acquired from text using machine learning techniques. He loves to code and co-created several open source projects, including the OWL API, KAON and SocialGrails.