Informal Seminar Series

2010 Seminars - All Seminars held at 4:00 p.m. in DC 2306C, the AI Lab unless specified

Wednesday, January 20, 2010
Title:	No Free Lunch in Machine Learning
Speaker:	Shai Ben-David
Abstract:	One of the most fundamental insights of machine learning is the No Free Lunch (NFL) principle, stating that no learning is possible without the application of prior domain knowledge. In spite of the crisp and basic nature of this principle, many seem to be ignorant of it and of its implications. On the application side, we come across doctors that expect machine learning to deliver medical insights based on just raw archives of patient files, data base people that expect clustering to detect duplicate records from input consisting of just a collection of records and the list goes on and on. Even more bothersome is the recent emergence of CS professionals that promise (either implicitly or explicitly) "universal" machine learning tools that can "learn any needed the prior knowledge autonomously". Two upcoming distinguished talks here in Waterloo (Hinton's talk at the CS club on Jan 26th, and Lipson's Perimeter Institute February Public Talk http://www.perimeterinstitute.ca/en/Outreach/Public_Lectures/Public_Lectures/) are relevant in this context. In my talk, I will explain the NFL principles and discuss its applications in several established as well as emerging ML paradigms (including why it rules out the existence of universal learning tools).

Wednesday, January 20, 2010

Title:

No Free Lunch in Machine Learning

Speaker:

Shai Ben-David

Abstract:

One of the most fundamental insights of machine learning is the No Free Lunch (NFL) principle, stating that no learning is possible without the application of prior domain knowledge.

In spite of the crisp and basic nature of this principle, many seem to be ignorant of it and of its implications. On the application side, we come across doctors that expect machine learning to deliver medical insights based on just raw archives of patient files, data base people that expect clustering to detect duplicate records from input consisting of just a collection of records and the list goes on and on. Even more bothersome is the recent emergence of CS professionals that promise (either implicitly or explicitly) "universal" machine learning tools that can "learn any needed the prior knowledge autonomously". Two upcoming distinguished talks here in Waterloo (Hinton's talk at the CS club on Jan 26th, and Lipson's Perimeter Institute February Public Talk http://www.perimeterinstitute.ca/en/Outreach/Public_Lectures/Public_Lectures/)
are relevant in this context.

In my talk, I will explain the NFL principles and discuss its applications in several established as well as emerging ML paradigms (including why it rules out the existence of universal learning tools).

Wednesday, January 27, 2010, MC 6007, 4:30 p.m.
Title:	Bayesian Language Modelling
Speaker:	Pascal Poupart
Abstract:	Recent years have seen an increased use of Bayesian techniques for language modelling due to their flexibility, their ability to model various syntactic and semantic sructure and the possibility to include prior knowledge. In this talk, I will first review existing techniques for topic modelling such as probabilistic latent semantic analysis, latent Dirichlet allocation and hierarchical Dirichlet processes. I will also review co-location techniques such as n-gram models and the hierarchical Pitman-Yor process. I will then discuss how topic models and co-location models can be combined to yield powerful language models that simultaneously capture some elements of syntax and semantics. I will show the benefits of such models to extract short synthesizing phrases to label clusters of documents in an unsupervised fashion. If time permits, I will also discuss some of the challenges to extend these models to deal with entities and relations that are often thoughts as the pillars of deep natural language understanding. I may also discuss some of the challenges to learn these models in an online fashion by sequential monte carlo techniques. This work was done in collaboration with Ting Liu (UW), Ruitong Huang (UW), Andy Chiu (Google) and Finnegan Southey (Google) This work is supported financially by a Google Research Award

Wednesday, January 27, 2010, MC 6007, 4:30 p.m.

Title:

Bayesian Language Modelling

Speaker:

Pascal Poupart

Abstract:

Recent years have seen an increased use of Bayesian techniques for language modelling due to their flexibility, their ability to model various syntactic and semantic sructure and the possibility to include prior knowledge. In this talk, I will first review existing techniques for topic modelling such as probabilistic latent semantic analysis, latent Dirichlet allocation and hierarchical Dirichlet processes. I will also review co-location techniques such as n-gram models and the hierarchical Pitman-Yor process. I will then discuss how topic models and co-location models can be combined to yield powerful language models that simultaneously capture some elements of syntax and semantics. I will show the benefits of such models to extract short synthesizing phrases to label clusters of documents in an unsupervised fashion. If time permits, I will also discuss some of the challenges to extend these models to deal with entities and relations that are often thoughts as the pillars of deep natural language understanding. I may also discuss some of the challenges to learn these models in an online fashion by sequential monte carlo techniques.

This work was done in collaboration with Ting Liu (UW), Ruitong Huang (UW), Andy Chiu (Google) and Finnegan Southey (Google)

This work is supported financially by a Google Research Award

Wednesday, February 3, 2010
Title:	Bayesian Language Modelling (part II)
Speaker:	Pascal Poupart
Abstract:	Recent years have seen an increased use of Bayesian techniques for language modelling due to their flexibility, their ability to model various syntactic and semantic sructure and the possibility to include prior knowledge. In this talk, I will first review existing techniques for topic modelling such as probabilistic latent semantic analysis, latent Dirichlet allocation and hierarchical Dirichlet processes. I will also review co-location techniques such as n-gram models and the hierarchical Pitman-Yor process. I will then discuss how topic models and co-location models can be combined to yield powerful language models that simultaneously capture some elements of syntax and semantics. I will show the benefits of such models to extract short synthesizing phrases to label clusters of documents in an unsupervised fashion. If time permits, I will also discuss some of the challenges to extend these models to deal with entities and relations that are often thoughts as the pillars of deep natural language understanding. I may also discuss some of the challenges to learn these models in an online fashion by sequential monte carlo techniques. This work was done in collaboration with Ting Liu (UW), Ruitong Huang (UW), Andy Chiu (Google) and Finnegan Southey (Google) This work is supported financially by a Google Research Award

Wednesday, February 3, 2010

Title:

Bayesian Language Modelling (part II)

Speaker:

Pascal Poupart

Abstract:

This work was done in collaboration with Ting Liu (UW), Ruitong Huang (UW), Andy Chiu (Google) and Finnegan Southey (Google)

This work is supported financially by a Google Research Award

Wednesday, February 10, 2010, 2:30 p.m.
Title:	Learning a non-parametric mapping for Non-linear Dimensionality Reduction
Speaker:	Ali Ghodsi
Abstract:	The foremost nonlinear dimensionality reduction algorithms provide an embedding only for the given training data, with no straightforward extension for the test points. This shortcoming makes them unsuitable for problems such as classification and regression. On the other hand, linear dimensionality reduction algorithms are capable of handling the out-of-sample examples easily, but their effectiveness is limited by the linearity of the subspace they reveal. In this talk I propose a novel dimensionality reduction algorithm which learns a parametric mapping between the high-dimensional space and the embedded space. The key observation is that when the dimensionality of the data is greater than its quantity, it is always possible to find a linear transformation that preserves a given subset of distances, while changing the distances of another subset. We present a method that first maps the points into a high-dimensional feature space, and then explicitly searches for an affine transformation that preserves the local distances while pulling the non-neighbor points as far apart as possible. We formulate this search as an instance of semi-definite programming. The resulted transformation can then be used to map out-of-sample points into the embedded space. This is a joint work with Pooyan Khajehpour

Wednesday, February 10, 2010, 2:30 p.m.

Title:

Learning a non-parametric mapping for Non-linear Dimensionality Reduction

Speaker:

Ali Ghodsi

Abstract:

The foremost nonlinear dimensionality reduction algorithms provide an embedding only for the given training data, with no straightforward extension for the test points. This shortcoming makes them unsuitable for problems such as classification and regression. On the other hand, linear dimensionality reduction algorithms are capable of handling the out-of-sample examples easily, but their effectiveness is limited by the linearity of the subspace they reveal. In this talk I propose a novel dimensionality reduction algorithm which learns a parametric mapping between the high-dimensional space and the embedded space. The key observation is that when the dimensionality of the data is greater than its quantity, it is always possible to find a linear transformation that preserves a given subset of distances, while changing the distances of another subset. We present a method that first maps the points into a high-dimensional feature space, and then explicitly searches for an affine transformation that preserves the local distances while pulling the non-neighbor points as far apart as possible. We formulate this search as an instance of semi-definite programming. The resulted transformation can then be used to map out-of-sample points into the embedded space.

This is a joint work with Pooyan Khajehpour

Wednesday, February 24, 2010, 4:00 p.m.
Title:	Learning a non-parametric mapping for Non-linear Dimensionality Reduction
Speaker:	Ali Ghodsi (part II)
Abstract:	The foremost nonlinear dimensionality reduction algorithms provide an embedding only for the given training data, with no straightforward extension for the test points. This shortcoming makes them unsuitable for problems such as classification and regression. On the other hand, linear dimensionality reduction algorithms are capable of handling the out-of-sample examples easily, but their effectiveness is limited by the linearity of the subspace they reveal. In this talk I propose a novel dimensionality reduction algorithm which learns a parametric mapping between the high-dimensional space and the embedded space. The key observation is that when the dimensionality of the data is greater than its quantity, it is always possible to find a linear transformation that preserves a given subset of distances, while changing the distances of another subset. We present a method that first maps the points into a high-dimensional feature space, and then explicitly searches for an affine transformation that preserves the local distances while pulling the non-neighbor points as far apart as possible. We formulate this search as an instance of semi-definite programming. The resulted transformation can then be used to map out-of-sample points into the embedded space. This is a joint work with Pooyan Khajehpour

Wednesday, February 24, 2010, 4:00 p.m.

Title:

Learning a non-parametric mapping for Non-linear Dimensionality Reduction

Speaker:

Ali Ghodsi (part II)

Abstract:

This is a joint work with Pooyan Khajehpour

Wednesday, March 10, 2010, 2:30 p.m.
Title:	Deep kernel machines and stochastic stepwise ensembles
Speaker:	Mu Zhu
Abstract:	Kernel methods, such as the support vector machine, and ensemble methods, such as the AdaBoost algorithm, have defined two major domains of recent research in statistical machine learning. In this talk, I will describe two research projects, one in each domain. In one project, we try to classify network data using something that we call deep kernel machines (joint work in progress with Xiao Tang). In the other, we try to perform variable selection using a so-called stochastic stepwise ensemble (joint work in progress with Lu Xin).

Wednesday, March 24, 2010, 4:30 p.m., MC 2018 A/B
Title:	When Quantity makes Quality: Learning with Information Constraints
Speaker:	Ohad Shamir
Abstract:	In standard learning models, it is assumed that the learner has a complete and fully available training set at hand. However, in many real-world applications, obtaining full information on the training examples is expensive, illegal, or downright impossible. In this talk, I will discuss some new methods to learn in such information-constrained settings. These range from learning with only a few available features from each example; through coping with extremely noisy access to the data; to privacy-preserving learning. The underlying theme is that by gathering less information on more examples, one can be provably competitive with learning mechanisms which enjoy full access to the data. Along the way, I'll describe some novel techniques which might be of interest in their own right. The talk is based on some recent joint works with Nicolò Cesa-Bianchi and Shai Shalev-Shwartz.

Wednesday, March 24, 2010, 4:30 p.m., MC 2018 A/B

Title:

When Quantity makes Quality: Learning with Information Constraints

Speaker:

Ohad Shamir

Abstract:

In standard learning models, it is assumed that the learner has a complete and fully available training set at hand. However, in many real-world applications, obtaining full information on the training examples is expensive, illegal, or downright impossible. In this talk, I will discuss some new methods to learn in such information-constrained settings. These range from learning with only a few available features from each example; through coping with extremely noisy access to the data; to privacy-preserving learning. The underlying theme is that by gathering less information on more examples, one can be provably competitive with learning mechanisms which enjoy full access to the data. Along the way, I'll describe some novel techniques which might be of interest in their own right.

The talk is based on some recent joint works with Nicolò Cesa-Bianchi and Shai Shalev-Shwartz.

Wednesday, March 31, 2010, MC 6007, 2:30 p.m.
Title:	Visualizing High Dimensional Data: Applications of graph theory to statistical graphics
Speaker:	Wayne Oldford
Abstract:	In statistical data analysis, we are often looking for structure in high dimensional data. In classification problems, we are interested in how different known classes separate from (and relate to) one another in the data space of measured values. In clustering, we are hoping to discover distinct groups of points in this space. In model building, we are often interested in which data points agree/disagree with the conjectured model and whether important structure has been missed. And, … we hope to do all of this without prejudging the nature of the structure itself, even as far as to discover the unanticipated! In three or fewer dimensions, our visual system is an important asset, as much (even unanticipated) structure can be recognized effortlessly when points can be plotted so few dimensions. Unfortunately, even after formal dimension reduction methods have been applied, we are often faced with many more dimensions than three. In this talk, I will explore some visualization methods for high dimensional data. I will review and illustrate methods based on radial, parallel, and orthogonal coordinates. These three axis systems have different strengths and weaknesses. In all cases however, improvements may be had by casting the axis arrangement in a graph theoretic framework. I will explore the relevant graph theoretic representations and illustrate their use on real data sets. I will pay particular attention to the orthogonal axis system and show how graph traversal can be used to meaningfully navigate through high dimensional space. All software used is (or shortly will be) available as a package in the open source statistical system called R. This is based on joint work with Catherine Hurley of the National University of Ireland, Maynooth and Adrian Waddell of the University of Waterloo.

Wednesday, March 31, 2010, MC 6007, 2:30 p.m.

Title:

Visualizing High Dimensional Data: Applications of graph theory to statistical graphics

Speaker:

Wayne Oldford

Abstract:

In statistical data analysis, we are often looking for structure in high dimensional data. In classification problems, we are interested in how different known classes separate from (and relate to) one another in the data space of measured values. In clustering, we are hoping to discover distinct groups of points in this space. In model building, we are often interested in which data points agree/disagree with the conjectured model and whether important structure has been missed. And, … we hope to do all of this without prejudging the nature of the structure itself, even as far as to discover the unanticipated!

In three or fewer dimensions, our visual system is an important asset, as much (even unanticipated) structure can be recognized effortlessly when points can be plotted so few dimensions. Unfortunately, even after formal dimension reduction methods have been applied, we are often faced with many more dimensions than three.

In this talk, I will explore some visualization methods for high dimensional data. I will review and illustrate methods based on radial, parallel, and orthogonal coordinates. These three axis systems have different strengths and weaknesses. In all cases however, improvements may be had by casting the axis arrangement in a graph theoretic framework. I will explore the relevant graph theoretic representations and illustrate their use on real data sets. I will pay particular attention to the orthogonal axis system and show how graph traversal can be used to meaningfully navigate through high dimensional space.

All software used is (or shortly will be) available as a package in the open source statistical system called R.

This is based on joint work with Catherine Hurley of the National University of Ireland, Maynooth and Adrian Waddell of the University of Waterloo.

Wednesday, April 14, 2010, 2:30 p.m.
Title:	An Introduction to Reinforcement Learning
Speaker:	Pascal Poupart
Abstract:	In this talk, I will introduce the topic of reinforcement learning and relate it to other areas of machine learning. While reinforcement learning was originally inspired by animal learning (i.e., how to train animals by reinforcements), today, it has evolved into one the most comprehensive and challenging form of machine learning. From a practical perspective, a wide variety of applications from robotic control and spoken dialogue management to game playing and ad online ad placement can be naturally modeled as reinforcement learning problems. From a theoretical perspective, reinforcement learning includes active learning, online learning (i.e., sequential decision making) and arbitrary loss functions. Furthermore, reinforcement learning is neither supervised nor unsupervised, but somewhere in between since there data provided is only correlated with the decisions. Depending on time, I will try to cover some of the important results from learning theory and Bayesian learning.

Wednesday, April 21, 2010, 2:30 p.m.
Title:	Fast Feature Selection for Gene Expression Data via Hilbert-Schmidt Independence Criterion.
Speaker:	Hadi Zarkoob
Abstract:	A novel feature selection technique for microarray gene expression data is proposed. It is based on the Hilbert-Schmidt independence criterion, and partly motivated by Rank-One Downdate (R1D) and the Singular Value Decomposition (SVD). The algorithm selects a small set of genes such that the response variable depends mainly on this subset, at the exclusion of the rest of the genes. The algorithm is computationally very fast and scalable to large data sets, and it does not require the number of important genes as an explicit input. Experimental results of the proposed technique are presented on some synthetic and well-known microarray data sets.