4th International Winter School on Big Data

Timişoara, Romania, January 22-26, 2018

Course Description


Keynotes (to be completed)

Courses (to be completed)







Courses


Paul Bliese   
Associate Professor of Business Administration in the Management Department of the Darla Moore School of Business at the University of South Carolina.
Using R for Mixed-effects (Multilevel) Models [introductory/intermediate]

Summary:

Mixed-effects or multilevel models are commonly used when data have some form of nested structure. For instance, individuals may be nested within workgroups, or repeated measures may be nested within individuals. Nested structures in data are often accompanied by some form of non-independence. For example, in work settings, individuals in the same workgroup typically display some degree of similarity with respect to performance or they provide similar responses to questions about aspects of the work environment. Likewise, in repeated measures data, individuals usually display a high degree of similarity in responses over time. Non-independence may be considered either a nuisance variable or something to be substantively modeled, but the prevalence of nested data requires that analysts have a variety of tools to deal with nested data. This course provides and introduction to (1) the theoretical foundation, and (2) resources necessary to conduct a wide range of multilevel analyses. All practical exercises are conducted in R. Participants are encouraged to bring datasets to the course and apply the principles to their specific areas of research.

Syllabus:

Session 1
1. Introduction and overview of Multilevel Models
2. Introduction to R and the nlme and multilevel packages
Session 2:
3. Nested Data and Mixed-Effects Models in nlme
4. R Code for Models and Introduction to Functions Commonly used in Data Manipulation
Session 3:
5. Repeated Measures data and Growth Models in nlme
6. R Code for Models and Introduction to Functions Commonly used in Data Manipulation

Pre-requisites:

Basic understanding of regression. An installed version of R (https://cran.r-project.org/) on a laptop for completing exercises. Users are also encouraged to install R-Studio (https://www.rstudio.com/)

References:

Bliese, P. D. (2016). Multilevel Modeling in R (v. 2.6). https://cran.r-project.org/doc/contrib/Bliese_Multilevel.pdf

Short Bio

Paul D. Bliese, Ph.D. joined the Management Department at the Darla Moore School of Business,University of South Carolina in 2014. Prior to joining South Carolina, he spent 22 years as a researchpsychologist at the Walter Reed Army Institute of Research where he conducted research on stress,adaptation, leadership, well-being, and performance. Professor Bliese has long-term interests inunderstanding how analytics contribute to theory development and in applying analytics to complexorganizational problems. He built and maintains the multilevel package for R. Professor Bliese hasserved on numerous editorial boards, and has been an associate editor at the Journal of AppliedPsychology since 2010. In July of 2017 he took over a editor-in- chief for Organizational ResearchMethods.








Geoffrey C. Fox   
Chair, Intelligent Systems Engineering, School of Informatics and Computing. Distinguished Professor of Computing, Engineering and Physics.Director of the Digital Science Center, Indiana University – Bloomington
Integration of HPC, Big Data Analytics and Software Ecosystem [Intermediate]

Summary:

Two major trends in computing systems are the growth in high performance computing (HPC) with an international exascale initiative, and the big data phenomenon with an accompanying cloud infrastructure of well publicized dramatic and increasing size and sophistication. This tutorial weaves these trends together using some key building blocks. The first is HPC-ABDS, the High Performance Computing (HPC) enhanced Apache Big Data Stack. (ABDS). Here we aim at using the major open source Big Data software environment but develop the principles allowing use of HPC software and hardware to achieve good performance. We give several examples of software (for example Hadoop and Heron) and algorithms implemented in this software. The second building block is the SPIDAL library (Scalable Parallel Interoperable Data Analytics Library) of scalable machine learning and data analysis software. We give examples including clustering, topic modelling and dimension reduction and their visualization with a framework called Harp. The third building block is an analysis of simulation and big data use cases in terms of 64 separate features (varying from data volume to “suitable for MapReduce” to kernel algorithm used). This allows an understanding of what type of hardware and software is needed for what type of exhibited features; it allows a one to either unify or distinguish applications across the simulation and Big Data regimes. We show that using a broad range of applications requires a variety of capabilities that seem best packaged as a reconfigurable toolkit Twister2.

Syllabus:

Session 1: HPC-ABDS and the Ogres
-Rationale for using ABDS (Apache Big Data Stack)
-Architecture of ABDS
-Reasons to enhance ABDS with HPC
-Motivating Applications and Big Data Ogres
-Examples including Harp (for Hadoop), HPC-Heron; rationale for Twister2

Session 2: Twister2 and Harp
-Design of Twister2 -- a toolkit of the parts in Heron, Spark, Flink, Hadoop, MPI, Harp
-Design of Harp -- a High Performance Machine Learning Framework
-Using Harp and Twister2

Session 3: SPIDAL Scalable Parallel Interoperable Data Analytics Library
-Some important issues in getting high performance in parallel applications
-A few short discussions of individual machine learning cases and their use in applications
-These are intermixed with performance results including accelerators and
-'SPIDAL Java' -- principles to make Java run fast on parallel applications

Pre-requisites:

Some familiarity with ABDS software such as Hadoop, Spark, Flink, Storm, Heron and HPC technologies such as MPI would be helpful. Some familiarity with parallel computing (algorithms and software) helpful. Some familiarity with data analytics helpful.

References:

Geoffrey Fox, David Crandall, Judy Qiu, Gregor Von Laszewski, Shantenu Jha, John Paden, Oliver Beckstein, Tom Cheatham, Madhav Marathe, Fusheng Wang, 'Tutorial Program', BigDat 2017 MIDAS and SPIDAL Tutorial Bari Italy February 13-14 2017
http://dsc.soic.indiana.edu/publications/SPIDAL-DIBBSreport_July2016.pdf 21 month report of SPIDAL(Scalable Parallel Interoperable Data Analytics Library) project.
Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe, Vibhatha Abeykoon, Geoffrey Fox, 'Twister2: Design of a Big Data Toolkit'
http://hpc-abds.org/kaleidoscope/ HPC-ABDS and Big Data Ogres Analysis
Geoffrey C. Fox, Vatche Ishakian, Vinod Muthusamy, Aleksander Slominski, 'Status of Serverless Computing and Function-as-a-Service(FaaS) in Industry and Research', Report from workshop and panel at the First International Workshop on Serverless Computing (WoSC) Atlanta, June 5 2017
B. Peng, B. Zhang, L. Chen, M. Avram, R. Henschel, C. Stewart, S. Zhu, E. Mccallum, L. Smith, T. Zahniser, J. Omer, J. Qiu. 'HarpLDA+: Optimizing Latent Dirichlet Allocation for Parallel Efficiency' Technical Report (August 2017)
Supun Kamburugamuve, Pulasthi Wickramasinghe, Saliya Ekanayake, Geoffrey C. Fox, 'Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink', International Journal of High Performance Computing Applications to be published.
Also see Projects (with updates) at https://www.researchgate.net/profile/Geoffrey_Fox and presentations at https://www.dsc.soic.indiana.edu/presentations

Short Bio

Geoffrey Fox received a Ph.D. in Theoretical Physics from Cambridge University where he was Senior Wrangler. He is now a distinguished professor of Engineering, Computing, and Physics at Indiana University where he is director of the Digital Science Center, and both Department Chair and Associate Dean for Intelligent Systems Engineering at the School of Informatics, Computing, and Engineering. He previously held positions at Caltech, Syracuse University, and Florida State University after being a postdoc at the Institute for Advanced Study at Princeton, Lawrence Berkeley Laboratory, and Peterhouse College Cambridge. He has supervised the Ph.D. of 70 students and published around 1300 papers (over 450 with at least 10 citations) in physics and computing with an hindex of 75 and over 31,500 citations. He is a Fellow of APS (Physics) and ACM (Computing) and works on the interdisciplinary interface between computing and applications. He currently researches the application of computer science from infrastructure to analytics in Biology, Pathology, Sensor Clouds, Earthquake and Ice-sheet Science, Image processing, Deep Learning, Network Science, Financial Systems and Particle Physics. The infrastructure work is built around Software Defined Systems on Clouds and Clusters. The analytics focuses on scalable parallelism. He is an expert on streaming data and robot-cloud interactions. He is involved in several projects to enhance the capabilities of Minority Serving Institutions. He has experience in online education and its use in MOOCs for areas like Data and Computational Science.








Minos Garofalakis   
Professor, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece
Data Streaming Analytics [intermediate/advanced]

Summary:

Effective Big Data analytics need to rely on algorithms for querying and analyzing massive, continuous data streams (that is, data that is seen only once and in a fixed order) with limited memory and CPU-time resources. Such streams arise naturally in emerging large-scale event monitoring applications; for instance, network-operations monitoring in large ISPs, where usage information from numerous network devices needs to be continuously collected and analyzed for interesting trends and real-time reaction to different scenarios (e.g., hotspots or DDoS attacks). In addition to memory- and time-efficiency concerns, the inherently distributed nature of such applications also raises important communication-efficiency issues, making it critical to carefully optimize the use of the underlying communication infrastructure. This course will provide an overview of some key algorithmic tools for supporting effective, real-time analytics over streaming data. Our primary focus will be on small-space sketch synopses for approximating continuous data streams, and their applicabilty in both centralized and distributed settings.

Syllabus:

1. Introduction and Motivation
2. Data Streaming Models and Mathematical Tools
3. Basic Algorithmic Tools for Data Streams
   * Reservoir Sampling
   * Bag Synopses: AMS and CountMin Sketches
   * Set Synopses: FM Sketches and Distinct Sampling
4. The Sliding Window Model and Exponential Histograms
5. Distributed Data Streaming
   * Basic Models and Techniques
   * The Geometric Method and Convex Safe Zones
6. Conclusions and Looking Forward
7. (Time-permitting) Hands-on Experience with Streaming Tools

Pre-requisites:

Database management systems, design and analysis of algorithms, randomized algorithms

References:

Surveys/Monographs:
1. Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. “Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”, Foundations and Trends in Databases 4(1-3): 1-294 (2012)
2. Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi (Eds.). “Data-Stream Management — Processing High-Speed Data Streams”, Springer-Verlag, New York (Data-Centric Systems and Applications Series), July 2016 (ISBN 978-3-540-28607-3).
Papers:
1. Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. ACM STOC 1996.
2. Noga Alon, Phillip B. Gibbons, Yossi Matias, Mario Szegedy: Tracking Join and Self-Join Sizes in Limited Storage. ACM PODS 1999.
3. Graham Cormode, S. Muthukrishnan: An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. LATIN 2004.
4. Phillip B. Gibbons: Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. VLDB 2001.
5. Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani: Maintaining Stream Statistics over Sliding Windows. SIAM J. on Computing 31(6), 2002.
6. Graham Cormode, Minos N. Garofalakis: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 2008.
7. Izchak Sharfman, Assaf Schuster, Daniel Keren: A geometric approach to monitoring threshold functions over distributed data streams. ACM SIGMOD Conference 2006.
8. Minos N. Garofalakis, Daniel Keren, Vasilis Samoladas: Sketch-based Geometric Monitoring of Distributed Stream Queries. PVLDB 6(10), 2013.
9. Arnon Lazerson, Izchak Sharfman, Daniel Keren, Assaf Schuster, Minos N. Garofalakis, Vasilis Samoladas: Monitoring Distributed Streams using Convex Decompositions. PVLDB 8(5), 2015.

Short Bio

Minos Garofalakis is the Director of the Institute for the Management of Information Systems (IMIS) at the Athena Research and Innovation Centre in Athens, Greece, and a Professor of Computer Science at the School of Electrical and Computer Engineering of the Technical University of Crete (TUC), where he also directs the Software Technology and Network Applications Laboratory (SoftNet). He received his PhD in Computer Science from the University of Wisconsin-Madison in 1998, and has held positions as a Member of Technical Staff at Bell Labs, Lucent Technologies in Murray Hill, NJ (1998-2005), as a Senior Researcher at Intel Research Berkeley in Berkeley, CA (2005-2007), and as a Principal Research Scientist at Yahoo! Research in Santa Clara, CA (2007-2008). In parallel, he also held an Adjunct Associate Professor position at the EECS Department of the University of California, Berkeley (2006-2008). Prof. Garofalakis research interests are in the broad areas of Big Data analytics and large-scale machine learning, including database systems, centralized/distributed data streams, data synopses and approximate query processing, uncertain databases, and data mining and knowledge discovery. He has published over 150 scientific papers in top-tier international conferences and journals in these areas. His work has resulted in 36 US Patent filings (29 patents issued) for companies such as Lucent, Yahoo!, and AT&T. GoogleScholar gives over 12.000 citations to Prof. Garofalakis work, and an h-index value of 60. He is an IEEE Fellow (Class of 2017, 'for contributions to data streaming analytics'), an ACM Distinguished Scientist (2011), and a recipient of the TUC 'Excellence in Research' Award (2015), the Bell Labs President's Gold Award (2004), and the Bell Labs Teamwork Award (2003).








David Gerbing   
Professor of Quantitative Methods. Portland State University
Data Visualization with R [introductory]

Summary:

This seminar introduces the R language via data visualization, aka computer graphics, in the context of a discussion of best practices and consideration for the analysis of big data. Code to generate the graphs is presented in terms of R base graphics, Hadley Wickham's ggplot package, and the author's lessR package. The content of the seminar is summarized with R Markup files that include commentary and implementation of all the code presented in the seminar, available to all participants. These explanatory examples serve as templates for applications to new data sets.

Syllabus:

Day 1
-----
Introduction to R
R functions and syntax
R variable types
Read data into R

Specialized Graphic Functions
Functions from the lessR package
The ggplot function from the ggplot2 package
Base R graphics

Themes

Day 2
-----
Bar Charts for Distributions of Categorical Variables
R factor variables
Counts of one variable
Joint frequencies of two variables
Statistics of a second variable plotted against one variable

Graphs for Distributions of a Continuous Variable
Histograms and binning
Densities
Boxplot
Scatterplot, 1-dimensional
Introduction to the integrated Violin/Box/Scatterplot, the VBS plot
Scatterplots, 2-dimensional
With two or more continuous variables
A categorical variable with a continuous variable
Bubble plots with categorical variables
Two variable plot with a third variable, categorical or continuous

Day 3
-----
Scatterplots, 2-dimensional (continued)
Visualization of relationships for big data sets
Time Series Plots
One-variable plot
Stacked time-series plot
Area plots
Forecasts

Pre-requisites:

Basic understanding of data analysis

References:

Gerbing, D. W. (2013). R Data Analysis without Programming, NY: Routledge.
Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

Short Bio

David Gerbing, Ph.D., since 1987 Professor of Quantitative Methods, School of Business Administration, Portland State University. He received his Ph.D. in quantitative psychology from Michigan State University in 1979. From 1979 until 1987 he was an Assistant and then Associate Professor of Psychology and Statistics at Baylor University. He has authored R Data Analysis without Programming, which describes his lessR package, and many articles on statistical techniques and their application in a variety of journals that span several academic disciplines.








Maurizio Lenzerini   
Full professor in Computer Science. Sapienza Università di Roma.
Semantic technologies for open data publishing [intermediate/advanced]

Summary:

Semantic technologies may promote new ways of managing data within an organization. In particular, the paradigm of ontology-based data management provides techniques for accessing, using, and maintaining data by means of an ontology, i.e., a conceptual representation of the domain of interest in the underlying information system. This paradigm aims at addressing one important challenge of modern information systems, namely managing the autonomous, distributed, and heterogeneous data sources of an organization, and devising tools for deriving useful information and knowledge from them. On the other hand, many today's organization face, among others, the problem of publishing Open Data. Despite the current interest in this subject, a formal and comprehensive methodology supporting an organization in deciding which data to publish and carrying out precise procedures for publishing high-quality data, is still missing. In the course, we first provide an introduction to ontology-based data management, then we discuss the main techniques for using an ontology to access the data layer of an information system, and finally we illustrate the basic elements of a methodology for ontology-based Open Data publishing.

Syllabus:

Introduction to ontology-based data management (OBDM); languages for OBDM; query answering in OBDM; meta-modeling and higher-order ontology languages; the problem of open data publishing; ontology-based open data publishing.

Pre-requisites:

Basic notions of databases, logic, computational complexity.

References:

Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Riccardo Rosati: Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family. J. Autom. Reasoning 39(3): 385-429 (2007)
Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Riccardo Rosati: Linking Data to Ontologies. J. Data Semantics 10: 133-173 (2008)
Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, Riccardo Rosati: Ontologies and Databases: The DL-Lite Approach. Reasoning Web 2009: 255-356
Roman Kontchakov, Mariano Rodriguez-Muro, Michael Zakharyaschev: Ontology-Based Data Access with Databases: A Short Course. Reasoning Web 2013: 194-229
Domenico Lembo, Maurizio Lenzerini, Riccardo Rosati, Marco Ruzzi, Domenico Fabio Savo: Inconsistency-tolerant query answering in ontology-based data access. J. Web Sem. 33: 3-29 (2015)

Short Bio

Maurizio Lenzerini (http://www.dis.uniroma1.it/~lenzerini) is a Professor of Data Management at the Dipartimento di Ingegneria Informatica Automatica e Gestionale Antonio Ruberti of Sapienza Università di Roma, where is leading a research group working on Database Theory, Data Management, Knowledge Representation and Automated Reasoning, and Ontology-based Data Management and Integration. He is the author of more than 300 publications on the above topics, which received about 24.000 citations. According to Google Scholar, his h-index is currently 75. He was an invited keynote speaker at many international conferences. He is the recipient of two IBM Faculty Awards, he is a Fellow of EurAi (formerly European Coordinating Committee for Artificial Intelligence, ECCAI) since 2008, a Fellow of the ACM (Association for Computing Machinery) since 2009, a Fellow of the AAAI (Association for the Advance of Artificial Intelligence) since 2017, and a member of the Academia Europaea - The European Academy since 2011.








Bing Liu   
Distinguished Professor Department of Computer Science University of Illinois at Chicago (UIC)
Lifelong Learning and its Applications in NLP [intermediate/advanced]

Summary:

Lifelong Learning is an advanced machine learning (ML) paradigm that learns continuously, accumulates the knowledge learned in the past, and uses it to help future learning. In the process, the learner becomes more and more knowledgeable and effective at learning. This learning ability is one of the hallmarks of human intelligence. However, the current dominant ML paradigm learns in isolation: given a training dataset, it runs a ML algorithm on the dataset to produce a model. It does not retain the learned knowledge and use it in future learning. Although this isolated learning paradigm has been very successful, it requires a large number of training data and is only suitable for well-defined and narrow tasks. In comparison, we humans can learn effectively with a few examples because we have accumulated so much knowledge in the past which enables us to learn with little data or effort. Lifelong learning aims to achieve this capability. Applications such as chatbots and physical robots that interact with real-life environments all call for such learning capabilities. Without this ability, a system will probably never be truly intelligent. In this lecture, I will introduce lifelong learning and discuss some of its applications in natural language processing (NLP).

Syllabus:

1. Introduction and motivations
2. Definition of lifelong learning
3. Related learning paradigms
4. Lifelong supervised learning
5. Open world learning
6. Learning during model application
7. Lifelong topic modeling
8. Lifelong Learning in Information Extraction
9. Lifelong learning in belief propagation
10. Summary

Pre-requisites:

Basic knowledge of machine learning

References:

1. Zhiyuan Chen and Bing Liu. Lifelong Machine Learning. Morgan & Claypool Publishers, Nov 2016
2. Zhiyuan Chen and Bing Liu. Mining Topics in Documents: Standing on the Shoulders of Big Data. KDD-2014.
3. Zhiyuan Chen and Bing Liu. Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data. ICML-2014.
4. Zhiyuan Chen, Nianzu Ma and Bing Liu. Lifelong Learning for Sentiment Classification. ACL-2015, (short paper).
5. Geli Fei, Shuai Wang, and Bing Liu. 2016. Learning Cumulatively to Become More Knowledgeable. KDD-2016.
6. T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. Never-ending learning. AAAI, 2015.
7. Paul Ruvolo and Eric Eaton. ELLA: An efficient lifelong learning algorithm. ICML-2013.
8. Lei Shu, Hu Xu, and Bing Liu. Lifelong Learning CRF for Supervised Aspect Extraction. ACL-2017, (short paper).
9. Lei Shu, Bing Liu, Hu Xu, and Annice Kim. Lifelong-RL: Lifelong Relaxation Labeling for Separating Entities
and Aspects in Opinion Targets. EMNLP 2016.
10. Daniel L. Silver, Qiang Yang, and Lianghao Li. Lifelong machine learning systems: Beyond learning algorithms. In AAAI Spring Symposium: Lifelong Machine Learning, pages 49–55, 2013.
11. Sebastian Thrun. Is Learning the N-th Thing Any Easier than Learning the First? In Advances in neural information processing systems, pp. 640–646. Morgan Kaufmann Publishers, 1996.
12. Sebastian Thrun and Tom Mitchell. Lifelong robot learning. Springer, 1995.

Short Bio

Bing Liu is a distinguished professor of Computer Science at the University of Illinois at Chicago. He received his Ph.D. in Artificial Intelligence from the University of Edinburgh. His research interests include lifelong learning, sentiment analysis, data mining, machine learning, and natural language processing. He has published extensively in top conferences and journals. Two of his papers have received 10-year Test-of-Time awards from KDD. He also authored four books: one on lifelong learning, two on sentiment analysis, and one on Web mining. Some of his work has been widely reported in the press, including a front-page article in the New York Times. On professional services, he served as the Chair of ACM SIGKDD (ACM Special Interest Group on Knowledge Discovery and Data Mining) from 2013-2017. He has also served as program chair of many leading data mining conferences, including KDD, ICDM, CIKM, WSDM, SDM, and PAKDD, as associate editor of leading journals such as TKDE, TWEB, and DMKD, and as area chair or senior PC member of numerous natural language processing, AI, Web, and data mining conferences. He is a Fellow of ACM, AAAI and IEEE.








B.S. Manjunath   
Distinguished Professor. Electrical and Computer Engineering.University of California, Santa Barbara
Unstructured (Big) Data [introductory]

Summary:

Multimodal, unstructured data is ubiquitous: from consumer devices such as smart phones to scientific imaging, we encounter this data constantly, everywhere. This data is voluminous, accounting for a significant part of the digital data (one could speculate this to be >90%) generated around the world, daily. This data is complex and unstructured. In many applications, this data varies over time, and these time scales differ depending on the application. However, much of this multi-scale, multi-modal, unstructured and dynamic data remains under-exploited and un-interrogated. This lecture explores the challenges associated with such data analytics and how this differs from the more traditional big-data problems. Some interesting case studies in life sciences and medicine will be presented, with a focus on imaging data. The lecture will conclude with an overview of the BisQue software platform that is being developed at UCSB towards addressing the challenges associated with managing such data and creating reproducible workflows to analyze imaging data.

Syllabus:

Unstructured big-data challenges and examples.
Feature extraction in images/video: traditional methods to recent advances in deep learning methods.
Towards reproducible image informatics: BisQue open source project.

Pre-requisites:

Undergraduate level exposure to linear algebra and calculus. A course in image processing/computer vision will help but not required.

References:

Recent publications (conference/journal articles) on the above topics (to be added). For Bisque, see http://bioimage.ucsb.edu and http://cyverse.org

Short Bio

Manjunath is a Distinguished Professor of Electrical and Computer Engineering at the University of California, Santa Barbara. He received his Ph.D. in Electrical Engineering from the University of Southern California and the M.E. in Systems Science and Automation from the Indian Institute of Science. His research interests are in image informatics and in recent years he has focused on application to life and health sciences. He has published over 300 peer-reviewed articles, inventor on 24 patents, and co-edited the book on MPEG-7.








Fionn Murtagh   
Professor of Data Science, University of Huddersfield.
The New Science of Big Data Analytics, Based on the Geometry and the Topology of Complex, Hierarchic Systems. [introductory/advanced]

Summary:

These foundations of Data Science are solidly based on mathematics and computational science. The hierarchical nature of complex reality is part and parcel of this new, mathematically well-founded way of observing and interacting with (physical, social and all) realities.
These lectures include pattern recognition and knowledge discovery, computer learning and statistics. Addressed is how geometry and topology can uncover and empower the semantics of data. Key themes include: text mining; computational linear time hierarchical clustering, search and retrieval; the Correspondence Analysis platform that performs latent semantic factor space mapping, and accompanying hierarchical clustering.
Various application domains are covered in the case studies. These include in text mining, literary text, and social media – Twitter; and clustering in astronomy, chemistry, psychoanalysis. Final discussion is in regard to the increasingly important domains of smart environments, Internet of Things, health analytics, and further general scope of Big Data.

Syllabus:

Topics
- General Introduction. The Visualization and the Verbalization of Data.
- Analytics through the Geometry and Topology of Complex Systems. Metric, Ultrametric Frameworks. Hierarchy and Symmetry.
- Search and Discovery, Clustering and Regression: Pattern Recognition in Very High Dimensions.
- Text and Related Analytics. Between Lives of Narratives and Narratives of Lives.
Applications include:
- Social science, following Pierre Bourdieu.
- A few issues of cosmology.
- Literary work, between style and semantics.
- Large data analytics in astronomy, chemistry, finance.
- Social media analytics: Letting the data speak.
- Computational psychoanalysis.

The case studies at issue are in R. Through general discussion using R, this can be of benefit also for users of other software environments. Presentation encompasses general background and introduction, as well as potentially innovative developments.

Session 1: Semantic mapping, both metric and ultrametric.
Session 2: Application of textual narrative.
Session 3: Applications in search and discovery; new perspectives and new approaches.

Pre-requisites:

Having engagement in, or current plans, in regard to data analytics, and having perspectives or plans in application domains,

References:

F. Murtagh, Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics, Chapman & Hall, CRC Press, 2017. In the course material, relevant references will be included.

Short Bio

Fionn Murtagh is Professor of Data Science and was Professor of Computer Science, including Department Head, in many universities. Following his primary degres in Mathematics and Engineering Science, before his MSc in Computer Science, that was in Information Retrieval, in Trinity College Dublin, his first position as Statistician/Programmer was in national level (first and second level) education research. PhD in Université P&M Curie, Paris 6, with Prof. Jean-Paul Benzécri, was in conjunction with the national geological research centre, BRGM. After an initial 4 years as lecturer in computer science, there was a period in atomic reactor safety in the European Joint Research Centre, in Ispra (VA), Italy. On the Hubble Space Telescope, as a European Space Agency Senior Scientist, Fionn was based at the European Southern Observatory, in Garching, Munich for 12 years. For 5 years, Fionn was a Director in Science Foundation Ireland, managing mathematics and computing, nanotechnology, and introducing and growing all that is related to environmental science and renewable energy.
Fionn was Editor-in-Chief of the Computer Journal (British Computer Society) for more than 10 years, and is an Editorial Board member of many journals. With over 300 refereed articles and 30 books authored or edited, his fellowships and scholarly academies include: Fellow of: British Computer Society (FBCS), Institute of Mathematics and Its Applications (FIMA), International Association for Pattern Recognition (FIAPR), Royal Statistical Society (FRSS), Royal Society of Arts (FRSA). Elected Member: Royal Irish Academy (MRIA), Academia Europaea (MAE). Senior Member IEEE.
Website: http://www.fmurtagh.info








Raymond Ng   
Professor of Computer Science at the University of British Columbia
Mining and Summarizing Text Conversations [introductory]

Summary:

With the ever-increasing popularity of Internet technologies and communication devices such as smartphones and tablets, and with huge amounts of such conversational data generated on an hourly basis, intelligent text analytic approaches can greatly benefit organizations and individuals. For example, managers can find the information exchanged in forum discussions crucial for decision making; clinicians can use patients’ discussions to assist in chronic disease management.
In this lecture, we first give an overview of important applications of mining text conversations, using clinical applications and sentiment summarization of product reviews as case studies. Then we examine three topics in this area: (i) topic modeling; (ii) natural language summarization; and (iii) extraction of rhetorical structure and relationships in text.

Syllabus:

Pre-requisites:

Basic knowledge of machine learning and natural language processing is preferred but not required.

References:

Short Bio

Raymond Ng is a Professor of Computer Science (Canada Research Chair in Data Science and Analytics Chief Informatics Officer, PROOF) and his main research area for the past two decades is on data mining, with a specific focus on health informatics and text mining. He has published over 200 peer-reviewed publications on data clustering, outlier detection, OLAP processing, health informatics and text mining. He is the recipient of two best paper awards – from the 2001 ACM SIGKDD conference, the premier data mining conference in the world, and the 2005 ACM SIGMOD conference, one of the top database conferences worldwide. For the past decade, he has co-led several large-scale genomic projects funded by Genome Canada, Genome BC and industrial collaborators. Since the inception of the PROOF Centre of Excellence, which focuses on biomarker development for end-stage organ failures, he has held the position of the Chief Informatics Officer of the Centre. From 2009 to 2014, Dr. Ng was the associate director of the NSERC-funded strategic network on business intelligence.








Hanan Samet   
Center for Automation Research. Institute for Advanced Computer Studies. University of Maryland
Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial Databases, Geographic Information Systems (GIS), and Location-based Services [introductory/intermediate]

Summary:

The representation of multidimensional, spatial, and metric data is an important issue in applications of spatial database, geographic information systems (GIS), and location-based services. Recently, there has been much interest in hierarchical data structures such as quadtrees, octrees, and pyramids which are based on image hierarchies, as well methods that make use of bounding boxes which are based on object hierarchies. Their key advantage is that they provide a way to index into space. In fact, they are little more than multidimensional sorts. They are compact and depending on the nature of the spatial data they save space as well as time and also facilitate operations such as search. We describe hierarchical representations of points, lines, collections of small rectangles, regions, surfaces, and volumes. For region data, we point out the dimension-reduction property of the region quadtree and octree. We also demonstrate how to use them for both raster and vector data. For metric data that does not lie in a vector space so that indexing is based simply on the distance between objects, we nreview various representations such as the vp-tree, gh-tree, and mb-tree. In particular, we demonstrate the close relationship between these representations and those designed for a vector space. For all of the representations, we show how they can be used to compute nearest objects in an incremental fashion so that the number of objects need not be known in advance. The VASCO JAVA applet is presented that illustrates these methods (found at http://www.cs.umd.edu/~hjs/quadtree/index.html). They are also used in applications such as the SAND Internet Browser (found at http://www.cs.umd.edu/~brabec/sandjava). The above has been in the context of the traditional geometric representation of spatial data, while in the final part we review the more recent textual representation which is used in location-based services where the key issue is that of resolving ambiguities. For example, does ``London'' correspond to the name of a person or a location, and if it corresponds to a location, which of the over 700 different instances of ``London'' is it. The NewsStand system at newsstand.umiacs.umd.edu and the TwitterStand system at TwitterStand.umiacs.umd.edu system are examples. See also the cover article of the October 2014 issue of Communications of the ACM at http://tinyurl.com/newsstand-cacm or a cached version at http://www.cs.umd.edu/~hjs/pubs/cacm-newsstand.pdf and the accompanying video at https://vimeo.com/106352925

Syllabus:

1. Introduction
a. Sample queries
b. Spatial Indexing
c. Sorting approach
d. Minimum bounding rectangles (e.g., R-tree)
e. Disjoint cells (e.g., R+-tree, k-d-B-tree)
f. Uniform grid
g. Location-based queries vs: feature-based queries
h. Region quadtree
i. Dimension reduction
j. Pyramid
k. Region quadtrees vs: pyramids
l. Space ordering methods

2. Points
a. point quadtree
b. MX quadtree
c. PR quadtree
d. k-d tree
e. Bintree
f. BSP tree

3. Lines
a. Strip tree
b. PM1 quadtree
c. PM2 quadtree
d. PM3 quadtree
e. PMR quadtree

4. Rectangles and arbitrary objects
a. MX-CIF quadtree
b. Loose quadtree
c. Partition fieldtree
d. R-tree

5. Surfaces and Volumes
a. Restricted quadtree
b. Region octree
c. PM octree

6. Metric Data
a. vp-tree
b. gh-tree
c. mb-tree

7. Operations
a. Incremental nearest object location
b. Boolean set operations

8. Spatial Database Issues
a. General issues
b. Specific issues

9. Indexing spatiotextual data for location-based services delivered
on platforms such as smart phones and tablets
a. Incorporation of spatial synonyms in search engines
b. Toponym recognition
c. Toponym resolution
d. Spatial reader scope
e. Incorporation of spatiotemporal data
f. System integration issues
g. Demos of live systems on smart phones

10. Example systems
a. SAND internet browser
b. JAVA spatial data applets
c. STEWARD
d. NewsStand
e. TwitterStand

Pre-requisites:

Practitioners working in the areas of big spatial data and spatial data science that involve spatial databases, geographic information systems, and location-based services will be given a different perspective on data structures found to be useful in most applications. Familiarity with computer terminology and some programming experience is needed to follow this course.

References:

1. H. Samet. ``Foundations of Multidimensional Data Structures.'' Morgan-Kaufmann, San Francisco, 2006.
2. H. Samet. ``A sorting approach to indexing spatial data.'' International Journal of Shape Modeling. 14(1):15--37, 28(4):517--580, June 2008.
3. G. R. Hjaltason and H. Samet. ``Index-driven similarity search in metric spaces.'' ACM Transactions on Database Systems, 28(4):517--580, December 2003.
4. G. R. Hjaltason and H. Samet. ``Distance browsing in spatial databases.'' ACM Transactions on Database Systems, 24(2):265--318, June 1999. Also Computer Science TR-3919, University of Maryland, College Park, MD.
5. G. R. Hjaltason and H. Samet. ``Ranking in spatial databases.'' In Advances in Spatial Databases --- 4th International Symposium, SSD'95, M. J. Egenhofer and J. R. Herring, eds., Portland, ME, August 1995, 83--95. Also Springer-Verlag Lecture Notes in Computer Science 951.
6. H. Samet. ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS.'' Addison-Wesley, Reading, MA, 1990.
7. H. Samet. ``The Design and Analysis of Spatial Data Structures.'' Addison-Wesley, Reading, MA, 1990.
8. C. Esperanca and H. Samet. ``Experience with SAND/Tcl: a scripting tool for spatial databases.'' Journal of Visual Languages and Computing, 13(2):229--255, April 2002.
9. H. Samet, H. Alborzi, F. Brabec, C. Esperanca, G. R. Hjaltason, F. Morgan, and E. Tanin. ``Use of the SAND spatial browser for digital government applications.'' Communications of the ACM, 46(1):63--66, January 2003.
10. B. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. ``NewsStand: A new view on news.'' Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Irvine, CA, November 2008, 144--153.
11. H. Samet, J. Sankaranarayanan, M. D. Lieberman, M. D. Adelfio, B. C. Fruin, J. M. Lotkowski, D. Panozzo, J. Sperling, and B. E. Teitler. ``Reading news with maps by exploiting spatial synonyms.'' Communications of the ACM, 57(10):64--77, October 2014.
12. J. Sankaranarayanan, H. Samet, B. Teitler, M. D. Lieberman, and J. Sperling. ``TwitterStand: News in tweets.'' Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, November 2009, 42--51.
13. M. D. Lieberman, H. Samet, and J. Sankaranarayanan. ``Geotagging with local lexicons to build indexes for textually-specified spatial data.'' Proceedings of the 26th IEEE International Conference on Data Engineering, Long Beach, CA, March 2010, 201--212.
14. M. D. Lieberman and H. Samet. ``Multifaceted Toponym Recognition for Streaming News.'' Proceedings of the ACM SIGIR Conference. Beijing, July 2011, 843--852.
15. M. D. Lieberman and H. Samet. ``Adaptive Context Features for Toponym Resolution in Streaming News.'' Proceedings of the ACM SIGIR Conference. Portland, OR, August 2012, 731--740.
16. M. D. Lieberman and H. Samet. ``Supporting Rapid Processing and Interactive Map-Based Exploration of Streaming News. Proceedings of the 20th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Redondo Beach, CA, November 2012, 179--188/
17. Spatial Data Structure applets at; http://www.cs.umd.edu/~hjs/quadtree/index.html.

Short Bio

Hanan Samet (http://www.cs.umd.edu/~hjs/) is a Distinguished University Professor of Computer Science at the University of Maryland, College Park and is a member of the Institute for Computer Studies. He is also a member of the Computer Vision Laboratory at the Center for Automation Research where he leads a number of research projects on the use of hierarchical data structures for database applications, geographic information systems, computer graphics, computer vision, image processing, games, robotics, and search. He received the B.S. degree in engineering from UCLA, and the M.S. Degree in operations research and the M.S. and Ph.D. degrees in computer science from Stanford University. His doctoral dissertation dealt with proving the correctness of translations of LISP programs which was the first work in translation validation and the related concept of proof-carrying code. He is the author of the recent book ``Foundations of Multidimensional and Metric Data Structures'' (http://www.cs.umd.edu/~hjs/multidimensional-book-flyer.pdf) published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structures ``Design and Analysis of Spatial Data Structures'', and ``Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS,'' both published by Addison-Wesley in 1990. He is the Founding Editor-In-Chief of the ACM Transactions on Spatial Algorithms and Systems (TSAS), the founding chair of ACM SIGSPATIAL, a recipient of a Science Foundation of Ireland (SFI) Walton Visitor Award at the Centre for Geocomputation at the National University of Ireland at Maynooth (NUIM), 2009 UCGIS Research Award, 2010 CMPS Board of Visitors Award at the University of Maryland, 2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE Computer Society Wallace McDowell Award, and a Fellow of the ACM, IEEE, AAAS, IAPR (International Association for Pattern Recognition), and UCGIS (University Consortium for Geographic Science). He received best paper awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS Workshop, and the 2013 SIGSPATIAL GIR Workshop, as well as a best demo award at the 2011 SIGSPATIAL ACMGIS'11 Conference. His paper at the 2009 IEEE International Conference on Data Engineering (ICDE) was selected as one of the best papers for publication in the IEEE Transactions on Knowledge and Data Engineering. He was elected to the ACM Council as the Capitol Region Representative for the term 1989-1991, and is an ACM Distinguished Speaker.








Kyuseok Shim   
Professor of Electrical and Computer Engineering Department, Seoul National University, Korea
MapReduce Algorithms for Big Data Analysis [introductory/intermediate]

Summary:

There is a growing trend of applications that should handle big data. However, analyzing big data is very challenging today. For such applications, the MapReduce framework has recently attracted a lot of attention. MapReduce is a programming model that allows easy development of scalable parallel applications to process big data on large clusters of commodity machines. Google’s MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, I will first introduce the MapReduce framework based on Hadoop system available to everyone to run distributed computing algorithms using MapReduce. I will next discuss how to design efficient MapReduce algorithms and present the state-of-the-art in MapReduce algorithms for big data analysis. Since Spark is recently developed to overcome the shortcomings of MapReduce which is not optimized for of iterative algorithms and interactive data analysis, I will also present an outline of Spark as well as the differences between MapReduce and Spark. The intended audience of this tutorial is professionals who plan to develop efficient MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data analysis.

Syllabus:

Introduction to Hadoop and MapReduce
- Why parallel computing for big data analysis?
- Introduction on Map/Reduce
- Hadoop distributed file systems
- Word counting, inverted index building, page rank algorithms

MapReduce Algorithms for Database Systems
- Theta joins
- Similarity joins
- K-nearest neighbor joins
- Skyline computations
- Interval joins
- Subgraph enumeration
- Triangle counting
- Wavelet computation

MapReduce Algorithms for Data Mining
- K-means clustering
- EM, PLSI and LDA clustering
- Density-based clustering
- Association rule mining
- Sequential pattern mining
Introduction to Spark
Summary

Pre-requisites:

References:

Short Bio

Kyuseok Shim is currently a professor at electrical and computer engineering department in Seoul National University, Korea. Before that, he was an assistant professor at computer science department in KAIST and a member of technical staff for the Serendip Data Mining Project at Bell Laboratories. He was also a member of the Quest Data Mining Project at the IBM Almaden Research Center and visited Microsoft Research at Redmond several times as a visiting scientist. Kyuseok was named an ACM Fellow for his contributions to scalable data mining and query processing research in 2013. Kyuseok has been working in the area of databases focusing on data mining, search engines, recommendation systems, MapReduce algorithms, privacy preservation, query processing and query optimization. His writings have appeared in a number of professional conferences and journals including ACM, VLDB and IEEE publications. He served as a Program Committee member for SIGKDD, SIGMOD, ICDE, ICDM, ICDT, EDBT, PAKDD, VLDB and WWW conferences. He also served as a Program Committee Co-Chair for PAKDD 2003, WWW 2014, ICDE 2015 and APWeb 2016. Kyuseok was previously on the editorial board of VLDB as well as IEEE TKDE Journals and is currently a member of the VLDB Endowment Board of Trustees. He received the BS degree in electrical engineering from Seoul National University in 1986, and the MS and PhD degrees in computer science from the University of Maryland, College Park, in 1988 and 1993, respectively.








Jeffrey Ullman   
Stanford W. Ascherman Professor of Computer Science (Emeritus)
Big-data Algorithms That Aren't Machine Learning [introductory]

Summary:

We shall study algorithms that have been found useful in querying large data volumes. The emphasis is on algorithms that cannot be considered 'machine learning'

Syllabus:

Pre-requisites:

A course in algorithms at the advanced-undergraduate level is important. A course in database systems is helpful, but not required.

References:

We will be covering (parts of) Chapters 3, 4, 5, and 10 of the free text Mining of Massive Datasets, by Jure Leskovec, Anand Rajaraman, and Jeff Ullman, available at www.mmds.org

Short Bio

Link to the bio








Sebastián Ventura   
Professor of Computer Sciences and Artificial Intelligence in the University of Córdoba
Pattern Mining on Big Data [intermediate/advanced]

Summary:

Data analysis has a growing interest in many fields and it is concerned with the development of methods and techniques for making sense of data. Hence, there is a real incentive to collect, manage and transform raw data into significant and meaningful information that may be used for subsequent analysis that lead better decision making. When talking about data analysis, the key element is the pattern, which is used to represent any type of homogeneity and regularity in data, serving as a way of describing intrinsic and important properties of data. Pattern mining, however, is a really challenging task that requires a deep study, specially on massive and complex data where the computational and memory requirements are too high.Early exhaustive search approaches in this field were improved by adding some constraints into the mining process so the search space could be heavily reduced. These constraints helped user’s exploration and control, confining the space of solutions to those of interest. In spite of everything, the extraction of patterns on huge datasets still required large amount of memory since the number of feasible patterns exponentially increases with the number of items in data. Hence, different ways of solving this arduous task were proposed, being the use of metaheuristics a good option to avoid the analysis of the whole search space. Nevertheless, approaches based on metaheuristics are actually time consuming methods for extremely large datasets since any pattern is evaluated on any transaction. In this sense, novel data structures as well as parallel pattern mining methods have recently emerged as really interesting and promising research areas. Parallel processing is, perhaps, the principal research topic (in connection with the runtime) considered by the pattern mining community. In this regard, two main directions are being studied: (1) cluster of computers and (2) graphic processing units (GPUs). GPUs, for example, have been correctly applied by analyzing each transaction in parallel so the runtime is reduced. MapReduce, on the contrary, decomposes the problem into two phases: map and reduce. The input dataset is split into subsets so the map phase produces all the patterns within each of these subsets, assigning as a value the frequency of each pattern. Then, similar patterns are merged so the reduce phase is able to work on these sets to produce the final frequencies. MapReduce is one of the most widely studied emerging paradigms for intensive computing, achieving excellent results in a simple and robust way. However, recent research studies have demonstrated that these approaches are just recommended for really Big Data since the time required to load the parallel structure is even larger than the one required to.

Syllabus:

-Pattern mining: foundations and algorithms (time and memory requirements)
-Evolutionary algorithms for mining patterns (reducing the requirements)
-Data structure to reduce the evaluation process
-Parallel solutions:
a) based on GPUs
b) based on MapReduce

Pre-requisites:

Foundations of Pattern Mining: classical (exhaustive approaches), Foundations of Evolutionary Computation

References:

Basic References:

Charu C. Aggrawal. Data Mining. The Textbook. 1st Edition. Springer (2015). ISBN 978-3-319-14141-1
Charu C. Aggrawal and Jiawei Han. Frequent Pattern Mining. 1st. Edition. Springer (2014). ISBN 978-3-319-07820-5.
Sebastián Ventura, José María Luna: Pattern Mining with Evolutionary Algorithms. 1st Edition, Springer (2016), ISBN 978-3-319-33857-6.

Supplementary references

José María Luna, José Raúl Romero, Sebastián Ventura: Design and behavior study of a grammar-guided genetic programming algorithm for mining association rules. Knowl. Inf. Syst. 32(1): 53-76 (2012)
Alberto Cano, José María Luna, Sebastián Ventura: High performance evaluation of evolutionary-mined association rules on GPUs. The Journal of Supercomputing 66(3): 1438-1461 (2013)
José María Luna, Alberto Cano, Mykola Pechenizkiy, Sebastián Ventura: Speeding-Up Association Rule Mining With Inverted Index Compression. IEEE Trans. Cybernetics 46(12): 3059-3072 (2016)
José María Luna, Francisco Padillo, Mykola Pechenizkiy, Sebastián Ventura: Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data. IEEE Trans. Cybernetics (2017). DOI: 10.1109/TCYB.2017.2751081
José María Luna, Alberto Cano, Mykola Pechenizkiy, Sebastián Ventura: Speeding-Up Association Rule Mining With Inverted Index Compression. IEEE Trans. Cybernetics 46(12): 3059-3072 (2016)
José María Luna, Francisco Padillo, Mykola Pechenizkiy, Sebastián Ventura: Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data. IEEE Trans. Cybernetics (2017). DOI: 10.1109/TCYB.2017.2751081

Short Bio








Xiaowei Xu   
Professor. Department of Information Science. University of Arkansas at Little Rock
Mining Big Networked Data [introductory/advanced]

Summary:

Recent explosive growth of online social networks such as Facebook and Twitter provides a unique opportunity for many data mining applications including real time event detection, community structure detection and viral marketing. The course covers big data analytics for social networks. The emphasis will be on scalable algorithms for community structure detection, social tie modeling and structural pattern mining for big networks.

Syllabus:

Modularity-based community structure detection algorithms [1]
Structural clustering algorithms [2]
Label propagation algorithms [3]
Social tie modeling [4]
Parallel network clustering algorithm [5]
Discovering multiple social ties for characterization of individuals in online social networks [6]
Anytime network clustering algorithm for very big networks [7]

Pre-requisites:

Basic knowledge in computer algorithms and graph theory.

References:

1. Finding community structure in very large networks, Aaron Clauset, M. E. J. Newman, and Cristopher Moore, Phys. Rev. E 70, 066111 (2004).
2. X. Xu, N. Yuruk, Z. Feng, and T. A. Schweiger. Scan: a structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 824–833. ACM, 2007. 

3. Near linear time algorithm to detect community structures in large-scale networks, Raghavan, Usha Nandini and Albert, Reka and Kumara, Soundar, Phys. Rev. E 76, 036106 (2007)
4. S. Sintos and P. Tsaparas. Using strong triadic closure to characterize ties in social networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1466–1475. ACM, 2014. 

5. Weizhong Zhao Venkata Swamy Martha Xiaowei Xu PSCAN: A Parallel Structural Clustering Algorithm for Big Networks in MapReduce. 862-869 2013 AINA
6. Ming-Hua Chung, Gang Chen, Weizhong Zhao, Guohua Hao, Julian Pan, and Xiaowei Xu. Discovering Multiple Social Ties for Characterization of Individuals in Online Social Networks. The Third European Network Intelligence Conference (ENIC 2016), September 5-7, 2016, 1-8. 10.1109/ Wrocław, Poland.
7. Weizhong Zhao, Gang Chen, Xiaowei Xu. AnySCAN: An Efficient Anytime Framework with Active Learning for Large-scale Network Clustering. Proceedings of IEEE International Conference on Data Mining (ICDM 2017), New Orleans, November 18-21, 2017.

Short Bio

Professor Xiaowei Xu is a professor in the Department of Information Science at the University of Arkansas at Little Rock (UALR). He received his Ph.D. in computer science from the University of Munich in 1998. Prior to his appointment at UALR, Dr. Xu was a senior research scientist in Siemens Corporate Technology. Dr. Xu is adjunct professor in the Department of Mathematics at the University of Arkansas. Dr. Xu was an Oak Ridge Institute for Science and Education (ORISE) Faculty Research Program Member in the National Center for Toxicological Research's (NCTR) Center for Bioinformatics in the Division of Systems Biology from 2010 to 2014. He is also a consultant for companies including Siemens, Acxiom, Dataminr and L’Oreal. Dr. Xu’s research focuses on algorithms for data mining and machine learning. Dr. Xu is a recipient of 2014 ACM SIGKDD Test of Time Award for his work in density-based clustering algorithm (DBSCAN), which has received over 10,000 citations based on Google Scholar. Dr. Xu is program committee members and session chairs for premier forums including ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), and IEEE International Conferences on Data Mining (ICDM).








Zhongfei Zhang   
Professor, Department of Computer Science, Watson School of Engineering and Applied Sciences. Binghamton University
Relational and Media Data Learning and Knowledge Discovery [introductory/advanced]

Summary:

This course aims at exposing the audience a complete introduction to knowledge discovery and machine learning theories and case studies in real-world applications for relational and media data. The course begins with an extensive introduction to the fundamental concepts and theories of knowledge discovery and machine learning for relational and media data, and then showcases several important applications as case studies in the real-world as the example for big data knowledge discovery and learning.

Syllabus:

The course consists of three two-hour sessions. The syllabus is as follows:
First session: Introduction to the fundamental concepts and theories for relational and media data with the specific foci on an overview of the wide spectrum of techniques and technologies available as well as their relationships and applications to big data scenarios through real-world case studies; Second session: Specific discussions on the classic and state-of-the-art methods for relational data knowledge discovery and learning;
Third session: Specific discussions on the state-of-the-art methods on media data knowledge discovery;

Pre-requisites:

College math, fundamentals about computer science

References:

1. Bo Long, Zhongfei (Mark) Zhang, and Philip S. Yu, Relational Data Clustering: Models, Algorithms, and Applications, Taylor & Francis/CRC Press, 2010, ISBN: 9781420072617
2. Zhongfei (Mark) Zhang and Ruofei Zhang, Multimedia Data Mining -- A Systematic Introduction to Concepts and Theory, Taylor & Francis Group/CRC Press, 2008, ISBN: 9781584889663
3. Zhongfei (Mark) Zhang, Bo Long, Zhen Guo, Tianbing Xu, and Philip S. Yu, Machine Learning Approaches to Link-Based Clustering, in Link Mining: Models, Algorithms and Applications, Edited by Philip S. Yu, Christos Faloutsos, and Jiawei Han, Springer, 2010
4. Zhen Guo, Zhongfei Zhang, Eric P. Xing, and Christos Faloutsos, Multimodal Data Mining in a Multimedia Database Based on Structured Max Margin Learning, ACM Transactions on Knowledge Discovery and Data Mining, ACM Press, 2015
5. http://www.cs.binghamton.edu/~forweb/publicationsactive.html

Short Bio

Zhongfei (Mark) Zhang is a full professor of Computer Science at State University of New York (SUNY) at Binghamton, and directs the Multimedia Research Computing Laboratory in the University. He has also served as a QiuShi Chair Professor at Zhejiang University, China, and as the Director of the Data Science and Engineering Research Center at the university while he was on leave from State University of New York (SUNY) at Binghamton, USA. He has received a B.S. in Electronics Engineering (with Honors), an M.S. in Information Sciences, both from Zhejiang University, China, and a PhD in Computer Science from the University of Massachusetts at Amherst, USA. His research interests include knowledge discovery and machine learning for media and relational data, multimedia information indexing and retrieval, artificial intelligence, computer vision, and pattern recognition. He is the author and co-author of the first monograph on multimedia data mining and the first monograph on relational data clustering, respectively. His research is sponsored by a wide spectrum of government funding agencies, industrial labs, as well as private agencies noticeably including US NSF, US AFRL, CNRS in France, JSPS in Japan, and MOST and NSFC in China, New York State Government in US, and Zhejiang Provincial Government in China, as well as Kodak Research and Microsoft Research in US and Alibaba Group in China and Huang Kuancheng Foundation in Hong Kong, China. He has published over 200 papers in premier venues in his areas and is an inventor for more than 30 patents. He has served in several journal editorial boards and received several professional awards.