## TGDA@OSU TRIPODS Center Workshop: Theory and Foundations of TGDA

### Organizers

Tamal K. Dey
CSE, The Ohio State University
Matthew Kahle
Mathematics, The Ohio State University
Sebastian Kurtek
Statistics, The Ohio State University
Facundo Memoli
Mathematics, The Ohio State University
David Sivakoff
Statistics and Mathematics, The Ohio State University
Yusu Wang
Computer Science and Engineering, The Ohio State University

Topological and geometric ideas have already shown promises in producing novel perspectives and powerful algorithms for analyzing complex and diverse data. As the theory and foundations of topological data analysis continue to mature, we are presented with great opportunities to consolidate existing synergy as well as to establish new connections and collaborations among computational scientists, mathematicians, and statisticians, so as to form new perspectives and develop novel methodologies / algorithms for modern data analysis. This workshop, organized by the TGDA group at OSU, presents a timely platform to help achieve these goals.

Current PhD students and young researchers within 5 years of the PhD degree are qualified to apply for partial travel / accommodation support to attend the TGDA workshop.

There is also a one week summer school program that preceeds the workshop.

Larry A. Wasserman, Statistics, CMU
Peter Bubenik, Mathematics, U. Florida

#### Travel and Lodging Info

• Lodging - There are several great hotel and lodging options available near the Ohio State University Campus. For a full list of options and more information go here. The MBI is located in Jennings Hall at 1735 Neil Avenue on the 3rd floor and most OSU hotels should offer shuttle transportation to get you to and from the MBI on campus for the workshop each day.
• Airport - When you arrive at John Glenn Columbus International airport-CMH you can take a taxi to your hotel (or find your hotel shuttle if offered) by going to the ground transportation area of the terminal where they offer 24-hour Airport taxi service.
• Driving to MBI & Campus Parking - If you are driving to the workshop, the closest public parking garage near the MBI is the 12th Avenue Garage. The MBI is just a short walk east from here on 12th Ave. to the Intersection of Neil Ave. where we are located in Jennings Hall on the 3rd floor. Here is a Google walking map.

We very gratefully acknowledge funding and support from our NSF TRIPODS grant, our NSF RTG grant, The Ohio State Math Research Institute MRI and the Ohio State Mathematical Biosciences Institute MBI.

### Accepted Speakers

Nina Amenta
Computer Science, University of California, Davis
Ulrich Bauer
Geometry & Visualization group, TU MÃ¼nchen
Paul Bendich
Mathematics, Duke University; Geometric Data Analytics, Inc.
Peter Bubenik
Mathematics, University of Florida
Chao Chen
Computer Science, City University of New York (CUNY)
Justin Eldridge
Computer Science and Engineering, The Ohio State University
Samory Kpotufe
ORFE, Princeton University
Lizhen Lin
Department of Applied and Computational Mathematics and Statistics,
Edgar Lobaton
Electrical and Computer Engineering, North Carolina State University
Mauro Maggioni
Mathematics, Johns Hopkins University
J. S. Marron
Statistics & OR, UNC
Ezra Miller
Mathematics, Duke University
David Mount
Computer Science, University of Maryland
Laxmi Parida
Computational Genomics, IBM THOMAS J. WATSON RESEARCH CENTER
Dejan Slepcev
Mathematical Sciences, Carnegie Mellon University
Mariel Vazquez
Mathematics, University of California Davis
Carola Wenk
Computer Science, Tulane University
Monday, May 21, 2018
Time Session
09:15 AM
09:30 AM

Opening remarks

09:30 AM
10:30 AM
Mariel Vazquez - Modeling local reconnection events along DNA molecules

In Escherichia coli DNA replication yields two interlinked circular chromosomes. Returning the chromosomes to an unlinked monomeric state is essential to cell survival. Simplification of DNA topology is mediated by enzymes, such as recombinases and topoisomerases. Recombinases act by a local reconnection event. We investigate analytically minimal pathways of unlinking by local reconnection. We introduce a Monte Carlo method to simulate local reconnection, provide a quantitative measure to distinguish among pathways and identify a most probable pathway. These results point to a universal property relevant to any local reconnection event between two sites along one or two circles.

10:30 AM
11:00 AM

Break

11:00 AM
12:00 PM
Ulrich Bauer - The Reeb Graph Edit Distance is Universal

We consider the setting of Reeb graphs of piecewise linear functions and study distances between them that are stable, meaning that functions which are similar in the supremum norm ought to have similar Reeb graphs. We define an edit distance for Reeb graphs and prove that it is stable and universal, meaning that it provides an upper bound to any other stable distance. In contrast, via a specific construction, we show that the interleaving distance and the functional distortion distance on Reeb graphs are not universal.

(Joint work with Claudia Landi and Facundo MÃ©moli)

12:00 PM
02:00 PM

Lunch Break

02:00 PM
03:00 PM
Paul Bendich - Topology and Geometry for Vehicle-tracking Systems

Many systems employ sensors to interpret the environment.

The target-tracking task is to gather sensor data from the environment and then to partition these data into tracks that are produced by the same target. A key challenge is to "connect-the-dots": more precisely, to take a sensor observation at a given time and associate it with a previously-existing track (or to declare that this is a new object). This can be very challenging, especially when there are multiple targets present, and when different sensors "go blind" at various times. There are a number of high-level designs for multi-target multi-sensor tracking systems, but many contemporary ones fit within the multiple hypothesis (MHT) paradigm. Typical MHTs formulate the 'connect-the-dots' problem as one of Bayesian inference, with competing multi-track hypotheses receiving scores.

This talk surveys three recent efforts to integrate topological and geometric information into the formula for computing hypothesis scores. The first uses zero-dimensional persistent homology summaries of kinematic information in car tracking, and the second uses persistent homology summaries in state space to form grouping hypotheses for nautical traffic. Finally, a method using self-similarity matrices is employed to make useful cross-modal comparisons in heterogeneous sensor networks. In all three efforts, large improvements in MHT performance are observed.

This talk is based on work funded by OSD, NRO, AFRL, and NSF, and it was done with many collaborators, including Nathan Borggren, Sang Chin, Jesse Clarke, Jonathan deSena, John Harer, Jay Hineman, Elizabeth Munch, Andrew Newman, Alex Pieloch, David Porter, David Rouse, Nate Strawn, Christopher J. Tralie, Adam Watkins, Michael Williams, and Peter Zulch.

03:00 PM
03:30 PM

Break

03:30 PM
04:30 PM
Laxmi Parida - Combinatorics, Topology and Genomics

Many problems in genomics are combinatorial in nature, i.e., the interrelationship amongst the entities is at par, if not more important, than the value of the entity themselves.

Graphs are the most commonly used mathematical object to model such relationships. However, often it is important to capture not just pairwise, but more general $k$-wise relationships. TDA provides a natural basis to model such interactions and, additionally, it provides mechanisms for extracting signal patterns from noisy data.

I will discuss two applications of this model, one to population genomics and the other to meta-genomics. In the former, we model the problem of detecting admixture in populations and in the latter we deal with the problem of detecting false positives in microbiome data.

The unifying theme is the use of topological methods -- in particular, persistent homology. I will explain the underlying mathematical models and describe the results obtained in both cases.

04:30 PM
07:00 PM

Poster session and reception

Tuesday, May 22, 2018
Time Session
09:30 AM
10:30 AM
Ezra Miller - Real multiparameter persistent homology

Persistent homology with multiple continuous parameters presents fundamental challenges different from those arising with one real or multiple discrete parameters: no existing algebraic theory applies (even poorly or inadequately). In part that is because the relevant modules are wildly infinitely generated. This talk explains how and why real multiparameter persistence should nonetheless be practical for data science applications. The key is a finiteness condition that encodes topological tameness -- which occurs in all modules arising from data -- robustly, in equivalent combinatorial and homological algebraic ways. Out of the tameness condition surprisingly falls much of ordinary (that is, noetherian) commutative algebra, crucially including finite minimal primary decomposition and a concept of minimal generator. The geometry and relevance of these algebraic notions will be explained from scratch, assuming no prior experience with commutative algebra, in the context of two genuine motivating applications: summarizing probability distributions and topology of fruit fly wing veins.

10:30 AM
11:00 AM

Break

11:00 AM
12:00 PM
David Mount - Approximating Convex Bodies and Applications

Recently, a series of seminal results has established the central role that convex approximation plays in solving a number of computational problems in Euclidean spaces of constant dimension. These include fundamental applications such as computing the diameter of a point set, Euclidean minimum spanning trees, bichromatic closest pairs, width kernels, and nearest neighbor searching. In each case, an approximation parameter eps > 0 is given, and the problem is to compute a solution that is within a factor of 1+eps of the exact solution.

In this talk, I will present recent approaches to convex approximation. These approaches achieve efficiency by being locally sensitive to the shape and curvature of the body. This is achieved through a combination of methods, both new and classical, including Delone sets in the Hilbert metric, Macbeath regions, and John ellipsoids. We will explore the development of these methods and discuss how they can be applied to the above problems.

12:00 PM
02:00 PM

Lunch Break

02:00 PM
03:00 PM
Lizhen Lin - Optimization on manifolds: parallel algorithms and Bayesian optimization techniques

The first part of this talk focuses on presenting some communication efficient algorithms for parallel optimization problems on manifolds. Convergence properties of such algorithms will be analyzed. The second part of the talk introduces Bayesian optimization techniques on manifolds. The methodology and computational algorithms will be discussed in details. Applications are considered for a large class of manifolds of interests.

03:00 PM
03:30 PM

Break

03:30 PM
03:30 PM
J. S. Marron - Object Oriented Data Analysis: Principal Nested Submanifolds

Object Oriented Data Analysis is the statistical analysis of populations of complex objects. This is seen to be particularly relevant in the Big Data era, where it is argued that an even larger scale challenge is Complex Data. Data objects with a geometric structure constitute a particularly active current research area. This is illustrated using a number of examples where data objects naturally lie in manifolds and spaces with a manifold stratification. An overview of the area is given, with careful attention to vectors of angles, i.e. data objects that naturally lie on torus spaces. Prinicpal Nested Submanifolds, which are generalizations of flags, are proposed to provide new analogs of Principal Component Analysis for data visualization. Open problems as to how to weight components in a simultaneous fitting procedure are discussed.

Wednesday, May 23, 2018
Time Session
09:30 AM
10:30 AM
Chao Chen - A Practical Solution for Homology Localization

Computing the optimal cycle of a given homology class is a challenging problem. Existing algorithms, although polynomial, are far from practical. We propose a new algorithm based on the A-star search. A carefully designed heuristic function is proposed to guide the search for the optimal solution. Our method is the first to be practical in large scale image analysis tasks. We apply the method to cardiac image analysis, in which we use persistent homology to detect delicate muscle structures and compute their optimal representations. We believe the method can be applied to many other topology data analysis contexts, where we want to annotate the persistence diagram with additional geometric information.

10:30 AM
11:00 AM

Break

11:00 AM
12:00 PM
Edgar Lobaton - From Cyborg-Insect to Human Motion: Topological Data Analysis Applied to Time Series

Topological Data Analysis has been applied to a number of problems in order to study the geometric and topological structure of point cloud data. In this talk, we discuss their application to the analysis of time series from cyborg-insect to human motion. These tools allow us to identify different type of motion behaviors from the agents, and help us discover the geometric structure of their surroundings. The mathematical framework developed allows us to derive bounds on the sampling of the data space in order to ensure recovery of the correct topological information, and also provide guarantees on the robustness of these quantities to perturbations on the data.

06:00 PM
08:00 PM

Banquet @ The Faculty Club

Thursday, May 24, 2018
Time Session
09:30 AM
10:30 AM
Samory Kpotufe - Modal-Set Estimation using kNN graphs, and Applications to Clustering

Estimating the mode or modal-sets (i.e. extrema points or surfaces) of an unknown density from sample is a basic problem in data analysis. Such estimation is relevant to other problems such as clustering, outlier detection, or can simply serve to identify low-dimensional structures in high dimensional-data (e.g. point-cloud data from medical-imaging, astronomy, etc). Theoretical work on mode-estimation has largely concentrated on understanding its statistical difficulty, while less attention has been given to implementable procedures. Thus, theoretical estimators, which are often statistically optimal, are for the most part hard to implement. Furthermore for more general modal-sets (general extrema of any dimension and shape) much less is known, although various existing procedures (e.g. for manifold-denoising or density-ridge estimation) have similar practical aim. Iâ€™ll present two related contributions of independent interest: (1) practical estimators of modal-sets â€“ based on particular subgraphs of a k-NN graph â€“ which attain minimax-optimal rates under surprisingly general distributional conditions; (2) high-probability finite sample rates for k-NN density estimation which is at the heart of our analysis. Finally, Iâ€™ll discuss recent successful work towards the deployment of these modal-sets estimators for various clustering applications.

Much of the talk is based on a series of work with collaborators S. Dasgupta, K. Chaudhuri, U. von Luxburg, and Heinrich Jiang.

10:30 AM
11:00 AM

Break

11:00 AM
12:00 PM
Carola Wenk - Applications of persistent homology to low-dimensional data

Persistent homology is commonly applied to study the shape of high dimensional data sets. In this talk we consider two low-dimensional data scenarios in which both geometry and topology are of importance. We study 2D and 3D images of prostate cancer tissue, in which glands form cycles or voids. Treatment decisions for prostate cancer patients are largely based on the pathological analysis of the geometry and topology of these data. As a second scenario we consider the comparison of road networks.

12:00 PM
02:00 PM

Lunch Break

02:00 PM
03:00 PM
Peter Bubenik - Topological spaces of persistence modules and their properties

Persistence modules are a central algebraic object arising in topological data analysis. Interleaving provides a natural way to measure distances between persistence modules. We consider various classes of persistence modules and the relationships between them. We also study the resulting topological spaces and their basic topological properties.

03:00 PM
03:30 PM

Break

03:30 PM
04:30 PM
Justin Eldridge - Clustering Correctly

Clustering is an important machine learning task whose goal is to identify the natural groups (or "clusters") in data. Given a data set, what is the correct clustering? There is no single answer to this seemingly simple question. In a statistical setting, however, where it is assumed that the data are sampled from an underlying probability distribution, it is natural to define the clusters of the distribution itself. We then say that a clustering method is "correct" if its output converges in some sense to the ideal clustering of the distribution as the size of the data grows. In this talk, I discuss the correctness of clustering methods in two settings: first, when the data are points sampled from a probability density, and second, when the data are graphs generated from a graphon -- a powerful non-parametric random graph model of much recent interest. In each case, I identify the natural hierarchical cluster structure of the distribution, formalize a strong notion of convergence to the tree of the ideal clustering, and construct efficient clustering methods which converge -- and are therefore "correct".

Friday, May 25, 2018
Time Session
09:30 AM
10:30 AM
Mauro Maggioni - Multiscale Methods for Dictionary Learning, Regression and Optimal Transport for data near low-dimensional sets

We discuss a family of ideas, algorithms, and results for analyzing various new and classical problems in the analysis of high-dimensional data sets. These methods we discuss perform well when data is (nearly) intrinsically low-dimensional. They rely on the idea of performing suitable multiscale geometric decompositions of the data, and exploiting such decompositions to perform a variety of tasks in signal processing and statistical learning. In particular, we discuss the problem of dictionary learning, where one is interested in constructing, given a training set of signals, a set of vectors (dictionary) such that the signals admit a sparse representation in terms of the dictionary vectors. We then discuss the problem of regressing a function on a low-dimensional unknown manifold. For both problems we introduce a multiscale estimator, fast algorithms for constructing it, and give finite sample guarantees for its performance, and discuss its optimality. Finally, we discuss an application of these multiscale decompositions to the fast calculation of optimal transportation plans, introduce a multiscale version of optimal transportation distances, and discuss preliminary applications. These are joint works with S. Gerber, W. Liao, S. Vigogna.

10:30 AM
11:00 AM

Break

11:00 AM
12:00 PM
Dejan Slepcev - Asymptotics of objective functionals in semi-supervised learning

We consider a family of regression problems in a semi-supervised setting. Given real-valued labels on a small subset of data the task is to recover the function on the whole data set while taking advantage of the (geometric) structure provided by the large number of unlabeled data points. We consider a random geometric graph to represent the geometry of the data set. We study objective functions which reward the regularity of the estimator function and impose or reward the agreement with the training data. In particular we consider discrete p-Laplacian and fractional Laplacian regularizations.

We investigate asymptotic behavior in the limit where the number of unlabeled points increases while the number of training points remains fixed. We uncover a delicate interplay between the regularizing nature of the functionals considered and the nonlocality inherent to the graph constructions. We rigorously obtain almost optimal ranges on the scaling of the graph connectivity radius for the asymptotic consistency to hold. The talk is based on joint works with Matthew Dunlop, Andrew Stuart, and Matthew Thorpe.

Name Email Affiliation
Abdelrazek, Ahmed akader@cs.umd.edu Compute Science, University of Maryland
Amenta, Nina amenta@cs.ucdavis.edu Computer Science, University of California, Davis
Asiaee T., Amir asiaeetaheri.1@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Bauer, Ulrich mail@ulrich-bauer.org Geometry & Visualization group, TU MÃ¼nchen
Bendich, Paul bendich@math.duke.edu Mathematics, Duke University; Geometric Data Analytics, Inc.
Brunson, Jason jabrunso@vt.edu Center for Quantitative Medicine, UConn Health
Bubenik, Peter peter.bubenik@gmail.com Mathematics, University of Florida
Bungula, Wako wako-bungula@uiowa.edu Mathematics, University of Iowa
Cai, Chen chencai.math@gmail.com CSE, Ohio State University
Chen, Chao Computer Science, City University of New York (CUNY)
Chowdhury, Samir chowdhury.57@osu.edu Mathematics, The Ohio State University
Ciocanel, Veronica ciocanel.1@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Contreras Peruyero, Adriana Haydee haydeeperuyero@gmail.com Mathematics, National Autonomous University of Mexico (UNAM)
Dey, Tamal K. dey.8@osu.edu CSE, The Ohio State University
Edwards, Parker pedwards@ufl.edu Mathematics, University of Florida
Elchesen, Alex a.elchesen@ufl.edu Mathematics, University of Florida
Eldridge, Justin eldridge.48@osu.edu Computer Science and Engineering, The Ohio State University
Filip, Ioan ioan.filip@gmail.com Systems Biology, Columbia University
Flores Velazco, Alejandro afloresv@cs.umd.edu Computer Science, University of Maryland
Gakhar, Hitesh hitesh.gakhar@gmail.com Mathematics, Michigan State University
Hamilton, Wesley wham@live.unc.edu Mathematics, University of North Carolina, Chapel Hill
Hang, Haibin haibin.hang312@gmail.com Mathematics, Florida State University
Honaker, John honaker.32@osu.edu Statistics, The Ohio State University
Ignacio, Paul paulsamuel-ignacio@uiowa.edu Mathematics, University of Iowa
Jin, Kun jin.810@osu.edu computer science and engineering, Ohio State University
Kahle, Matthew kahle.70@osu.edu Mathematics, The Ohio State University
Kim, Woojin kim.5235@osu.edu Mathematics, The Ohio State University
Kpotufe, Samory samory@princeton.edu ORFE, Princeton University
Kurtek, Sebastian kurtek.1@stat.osu.edu Statistics, The Ohio State University
Lee, Kang-Ju leekj0706@snu.ac.kr Department of Mathematical Sciences, Seoul National University
Li, Binglin blnligeometry@gmail.com Statistics, University of Georgia
Lim, Sunhyuk lim.991@osu.edu Mathematics, The Ohio State University
Lin, Lizhen lizhen.lin@nd.edu Department of Applied and Computational Mathematics and Statistics,
Lobaton, Edgar edgar.lobaton@ncsu.edu Electrical and Computer Engineering, North Carolina State University
Luo, Hengrui luo.619@osu.edu Department of Statistics, The Ohio State University
Maggioni, Mauro mauro.maggioni@jhu.edu Mathematics, Johns Hopkins University
Mandal, Sayan mandal.25@buckeyemail.osu.edu Computer Science, The Ohio State University
Marron, J. S. marron@email.unc.edu Statistics & OR, UNC
McGee, Reginald mcgee.278@mbi.osu.edu Mathematical Biosciences Institute, The Ohio State University
Memoli, Facundo memoli@math.osu.edu Mathematics, The Ohio State University
Mike, Joshua mikejosh@msu.edu Computational Mathematics, Science, and Engineering, Michigan State University
Miller, Ezra ezra@math.duke.edu Mathematics, Duke University
Mirth, Joshua mirth@math.colostate.edu Mathematics, Colorado State University
Moon, Chul chulmoon@uga.edu Statistics, University of Georgia
Mount, David mount@cs.umd.edu Computer Science, University of Maryland
Needham, Tom needham.71@osu.edu Mathematics, The Ohio State University
Okutan, Osman okutan.1@osu.edu Mathematics, The Ohio State University
Osborne, Matthew osborne.334@osu.edu Mathematics, The Ohio State University
Palande, Sourabh sourabh@sci.utah.edu School of Computing, University of Utah
Parida, Laxmi parida@us.ibm.com Computational Genomics, IBM THOMAS J. WATSON RESEARCH CENTER
Patel, Dhruv drewbeep@ufl.edu Mathematics, University of Florida
Percival, Sarah sperciva@purdue.edu Mathematics, Purdue University
Piekenbrock, Matthew piekenbrock.5@wright.edu Computer Science and Engineering, Wright State University
Polanco, Luis polanco2@msu.edu Computational Mathematics, Science and Engineering, Michigan State University
Ryu, Hyunnam hr03863@uga.edu Statistics, University of Georgia
Saha, Abhijoy saha.58@osu.edu Statistics, The Ohio State University
Scoville, Nicholas nscoville@ursinus.edu Mathematics and Computer Science, Ursinus College
Singhal, Kritika singhal.53@osu.edu Mathematics, The Ohio State University
Sivakoff, David sivakoff.2@osu.edu Statistics and Mathematics, The Ohio State University
Slepcev, Dejan slepcev@math.cmu.edu Mathematical Sciences, Carnegie Mellon University
Solomon, Yitzchak ysolomon@math.brown.edu Mathematics, Brown University
Stefanou, Anastasios astefanou@albany.edu Mathematics and Statistics, University at Albany (SUNY)
Strait, Justin strait.50@osu.edu Statistics, The Ohio State University
Su, Ming Yi su.672@osu.edu Computer Science and Engineering, Ohio State University
Thomas, Ashleigh athomas@math.duke.edu Math, Duke University
Tu, Junyi junyitu@gmail.com Computer Science and Engineering, USF
Tymochko, Sarah tymochko@msu.edu Computational Mathematics, Science and Engineering, Michigan State University
Vazquez, Mariel mariel@math.ucdavis.edu Mathematics, University of California Davis
Vipond, Oliver vipond@maths.ox.ac.uk Mathematics Department, University of Oxford
Wagner, Alexander wagnera@ufl.edu Mathematics, University of Florida
Wan, Zhengchao wan.252@osu.edu Mathematics, The Ohio State University
Wang, Yusu yusu@cse.ohio-state.edu Computer Science and Engineering, The Ohio State University
Wang, Jiayuan wangjiayuan.zju@gmail.com Department of Computer Science and Engineering, Ohio State University
Wenk, Carola cwenk@tulane.edu Computer Science, Tulane University
Yap, XiuHuan yap.4@wright.edu DSO National Laboratories, Wright State University
Zhang, Tong zhang.5031@osu.edu Mathematics, Ohio State University
The Reeb Graph Edit Distance is Universal

We consider the setting of Reeb graphs of piecewise linear functions and study distances between them that are stable, meaning that functions which are similar in the supremum norm ought to have similar Reeb graphs. We define an edit distance for Reeb graphs and prove that it is stable and universal, meaning that it provides an upper bound to any other stable distance. In contrast, via a specific construction, we show that the interleaving distance and the functional distortion distance on Reeb graphs are not universal.

(Joint work with Claudia Landi and Facundo MÃƒÂ©moli)

Topology and Geometry for Vehicle-tracking Systems

Many systems employ sensors to interpret the environment.

The target-tracking task is to gather sensor data from the environment and then to partition these data into tracks that are produced by the same target. A key challenge is to "connect-the-dots": more precisely, to take a sensor observation at a given time and associate it with a previously-existing track (or to declare that this is a new object). This can be very challenging, especially when there are multiple targets present, and when different sensors "go blind" at various times. There are a number of high-level designs for multi-target multi-sensor tracking systems, but many contemporary ones fit within the multiple hypothesis (MHT) paradigm. Typical MHTs formulate the 'connect-the-dots' problem as one of Bayesian inference, with competing multi-track hypotheses receiving scores.

This talk surveys three recent efforts to integrate topological and geometric information into the formula for computing hypothesis scores. The first uses zero-dimensional persistent homology summaries of kinematic information in car tracking, and the second uses persistent homology summaries in state space to form grouping hypotheses for nautical traffic. Finally, a method using self-similarity matrices is employed to make useful cross-modal comparisons in heterogeneous sensor networks. In all three efforts, large improvements in MHT performance are observed.

This talk is based on work funded by OSD, NRO, AFRL, and NSF, and it was done with many collaborators, including Nathan Borggren, Sang Chin, Jesse Clarke, Jonathan deSena, John Harer, Jay Hineman, Elizabeth Munch, Andrew Newman, Alex Pieloch, David Porter, David Rouse, Nate Strawn, Christopher J. Tralie, Adam Watkins, Michael Williams, and Peter Zulch.

Topological spaces of persistence modules and their properties

Persistence modules are a central algebraic object arising in topological data analysis. Interleaving provides a natural way to measure distances between persistence modules. We consider various classes of persistence modules and the relationships between them. We also study the resulting topological spaces and their basic topological properties.

A Practical Solution for Homology Localization

Computing the optimal cycle of a given homology class is a challenging problem. Existing algorithms, although polynomial, are far from practical. We propose a new algorithm based on the A-star search. A carefully designed heuristic function is proposed to guide the search for the optimal solution. Our method is the first to be practical in large scale image analysis tasks. We apply the method to cardiac image analysis, in which we use persistent homology to detect delicate muscle structures and compute their optimal representations. We believe the method can be applied to many other topology data analysis contexts, where we want to annotate the persistence diagram with additional geometric information.

Persistence Landscapes are Graded Persistence Diagrams
Clustering Correctly

Clustering is an important machine learning task whose goal is to identify the natural groups (or "clusters") in data. Given a data set, what is the correct clustering? There is no single answer to this seemingly simple question. In a statistical setting, however, where it is assumed that the data are sampled from an underlying probability distribution, it is natural to define the clusters of the distribution itself. We then say that a clustering method is "correct" if its output converges in some sense to the ideal clustering of the distribution as the size of the data grows. In this talk, I discuss the correctness of clustering methods in two settings: first, when the data are points sampled from a probability density, and second, when the data are graphs generated from a graphon -- a powerful non-parametric random graph model of much recent interest. In each case, I identify the natural hierarchical cluster structure of the distribution, formalize a strong notion of convergence to the tree of the ideal clustering, and construct efficient clustering methods which converge -- and are therefore "correct".

Modal-Set Estimation using kNN graphs, and Applications to Clustering

Estimating the mode or modal-sets (i.e. extrema points or surfaces) of an unknown density from sample is a basic problem in data analysis. Such estimation is relevant to other problems such as clustering, outlier detection, or can simply serve to identify low-dimensional structures in high dimensional-data (e.g. point-cloud data from medical-imaging, astronomy, etc). Theoretical work on mode-estimation has largely concentrated on understanding its statistical difficulty, while less attention has been given to implementable procedures. Thus, theoretical estimators, which are often statistically optimal, are for the most part hard to implement. Furthermore for more general modal-sets (general extrema of any dimension and shape) much less is known, although various existing procedures (e.g. for manifold-denoising or density-ridge estimation) have similar practical aim. IÃ¢â‚¬â„¢ll present two related contributions of independent interest: (1) practical estimators of modal-sets Ã¢â‚¬â€œ based on particular subgraphs of a k-NN graph Ã¢â‚¬â€œ which attain minimax-optimal rates under surprisingly general distributional conditions; (2) high-probability finite sample rates for k-NN density estimation which is at the heart of our analysis. Finally, IÃ¢â‚¬â„¢ll discuss recent successful work towards the deployment of these modal-sets estimators for various clustering applications.

Much of the talk is based on a series of work with collaborators S. Dasgupta, K. Chaudhuri, U. von Luxburg, and Heinrich Jiang.

Optimization on manifolds: parallel algorithms and Bayesian optimization techniques

The first part of this talk focuses on presenting some communication efficient algorithms for parallel optimization problems on manifolds. Convergence properties of such algorithms will be analyzed. The second part of the talk introduces Bayesian optimization techniques on manifolds. The methodology and computational algorithms will be discussed in details. Applications are considered for a large class of manifolds of interests.

From Cyborg-Insect to Human Motion: Topological Data Analysis Applied to Time Series

Topological Data Analysis has been applied to a number of problems in order to study the geometric and topological structure of point cloud data. In this talk, we discuss their application to the analysis of time series from cyborg-insect to human motion. These tools allow us to identify different type of motion behaviors from the agents, and help us discover the geometric structure of their surroundings. The mathematical framework developed allows us to derive bounds on the sampling of the data space in order to ensure recovery of the correct topological information, and also provide guarantees on the robustness of these quantities to perturbations on the data.

Multiscale Methods for Dictionary Learning, Regression and Optimal Transport for data near low-dimensional sets

We discuss a family of ideas, algorithms, and results for analyzing various new and classical problems in the analysis of high-dimensional data sets. These methods we discuss perform well when data is (nearly) intrinsically low-dimensional. They rely on the idea of performing suitable multiscale geometric decompositions of the data, and exploiting such decompositions to perform a variety of tasks in signal processing and statistical learning. In particular, we discuss the problem of dictionary learning, where one is interested in constructing, given a training set of signals, a set of vectors (dictionary) such that the signals admit a sparse representation in terms of the dictionary vectors. We then discuss the problem of regressing a function on a low-dimensional unknown manifold. For both problems we introduce a multiscale estimator, fast algorithms for constructing it, and give finite sample guarantees for its performance, and discuss its optimality. Finally, we discuss an application of these multiscale decompositions to the fast calculation of optimal transportation plans, introduce a multiscale version of optimal transportation distances, and discuss preliminary applications. These are joint works with S. Gerber, W. Liao, S. Vigogna.

Object Oriented Data Analysis: Principal Nested Submanifolds

Object Oriented Data Analysis is the statistical analysis of populations of complex objects. This is seen to be particularly relevant in the Big Data era, where it is argued that an even larger scale challenge is Complex Data. Data objects with a geometric structure constitute a particularly active current research area. This is illustrated using a number of examples where data objects naturally lie in manifolds and spaces with a manifold stratification. An overview of the area is given, with careful attention to vectors of angles, i.e. data objects that naturally lie on torus spaces. Prinicpal Nested Submanifolds, which are generalizations of flags, are proposed to provide new analogs of Principal Component Analysis for data visualization. Open problems as to how to weight components in a simultaneous fitting procedure are discussed.

Real multiparameter persistent homology

Persistent homology with multiple continuous parameters presents fundamental challenges different from those arising with one real or multiple discrete parameters: no existing algebraic theory applies (even poorly or inadequately). In part that is because the relevant modules are wildly infinitely generated. This talk explains how and why real multiparameter persistence should nonetheless be practical for data science applications. The key is a finiteness condition that encodes topological tameness -- which occurs in all modules arising from data -- robustly, in equivalent combinatorial and homological algebraic ways. Out of the tameness condition surprisingly falls much of ordinary (that is, noetherian) commutative algebra, crucially including finite minimal primary decomposition and a concept of minimal generator. The geometry and relevance of these algebraic notions will be explained from scratch, assuming no prior experience with commutative algebra, in the context of two genuine motivating applications: summarizing probability distributions and topology of fruit fly wing veins.

Approximating Convex Bodies and Applications

Recently, a series of seminal results has established the central role that convex approximation plays in solving a number of computational problems in Euclidean spaces of constant dimension. These include fundamental applications such as computing the diameter of a point set, Euclidean minimum spanning trees, bichromatic closest pairs, width kernels, and nearest neighbor searching. In each case, an approximation parameter eps > 0 is given, and the problem is to compute a solution that is within a factor of 1+eps of the exact solution.

In this talk, I will present recent approaches to convex approximation. These approaches achieve efficiency by being locally sensitive to the shape and curvature of the body. This is achieved through a combination of methods, both new and classical, including Delone sets in the Hilbert metric, Macbeath regions, and John ellipsoids. We will explore the development of these methods and discuss how they can be applied to the above problems.

Combinatorics, Topology and Genomics

Many problems in genomics are combinatorial in nature, i.e., the interrelationship amongst the entities is at par, if not more important, than the value of the entity themselves.

Graphs are the most commonly used mathematical object to model such relationships. However, often it is important to capture not just pairwise, but more general $k$-wise relationships. TDA provides a natural basis to model such interactions and, additionally, it provides mechanisms for extracting signal patterns from noisy data.

I will discuss two applications of this model, one to population genomics and the other to meta-genomics. In the former, we model the problem of detecting admixture in populations and in the latter we deal with the problem of detecting false positives in microbiome data.

The unifying theme is the use of topological methods -- in particular, persistent homology. I will explain the underlying mathematical models and describe the results obtained in both cases.

A Geometric Variational Approach to Bayesian Inference

We propose a novel Riemannian geometric framework for variational inference in Bayesian models based on the nonparametric Fisher-Rao metric on the manifold of probability density functions. Under the square-root density representation, the manifold can be identified with the positive orthant of the unit hypersphere S^∞ in L^2, and the Fisher-Rao metric reduces to the standard L2 metric. Exploiting such a Riemannian structure, we formulate the task of approximating the posterior distribution as a variational problem on the hypersphere based on the &â€Œalpha;-divergence. This provides a tighter lower bound on the marginal distribution when compared to, and a corresponding upper bound unavailable with, approaches based on the Kullback-Leibler divergence. We propose a novel gradient-based algorithm for the variational problem based on Frechet derivative operators motivated by the geometry of S^∞, and examine its properties. Through simulations and real-data applications, we demonstrate the utility of the proposed geometric framework and algorithm on several Bayesian models.

Asymptotics of objective functionals in semi-supervised learning

We consider a family of regression problems in a semi-supervised setting. Given real-valued labels on a small subset of data the task is to recover the function on the whole data set while taking advantage of the (geometric) structure provided by the large number of unlabeled data points. We consider a random geometric graph to represent the geometry of the data set. We study objective functions which reward the regularity of the estimator function and impose or reward the agreement with the training data. In particular we consider discrete p-Laplacian and fractional Laplacian regularizations.

We investigate asymptotic behavior in the limit where the number of unlabeled points increases while the number of training points remains fixed. We uncover a delicate interplay between the regularizing nature of the functionals considered and the nonlocality inherent to the graph constructions. We rigorously obtain almost optimal ranges on the scaling of the graph connectivity radius for the asymptotic consistency to hold. The talk is based on joint works with Matthew Dunlop, Andrew Stuart, and Matthew Thorpe.

Automatic Detection and Uncertainty Quantification of Landmarks on Elastic Curves

A population quantity of interest in statistical shape analysis is the location of landmarks, which are points that aid in reconstructing and representing shapes of objects. We provide an automated, model-based approach to inferring landmarks given a sample of shape data. The model is formulated based on a linear reconstruction of the shape, passing through the specified points, and a Bayesian inferential approach is described for estimating unknown landmark locations. The question of how many landmarks to select is addressed in two different ways: (1) by defining a criterion-based approach, and (2) joint estimation of the number of landmarks along with their locations. Efficient methods for posterior sampling are also discussed. We motivate our approach using several simulated examples, as well as data obtained from applications in computer vision and biology; additionally, we explore placements and associated uncertainty in landmarks for various substructures extracted from magnetic resonance image slices.

Modeling local reconnection events along DNA molecules

In Escherichia coli DNA replication yields two interlinked circular chromosomes. Returning the chromosomes to an unlinked monomeric state is essential to cell survival. Simplification of DNA topology is mediated by enzymes, such as recombinases and topoisomerases. Recombinases act by a local reconnection event. We investigate analytically minimal pathways of unlinking by local reconnection. We introduce a Monte Carlo method to simulate local reconnection, provide a quantitative measure to distinguish among pathways and identify a most probable pathway. These results point to a universal property relevant to any local reconnection event between two sites along one or two circles.

Applications of persistent homology to low-dimensional data

Persistent homology is commonly applied to study the shape of high dimensional data sets. In this talk we consider two low-dimensional data scenarios in which both geometry and topology are of importance. We study 2D and 3D images of prostate cancer tissue, in which glands form cycles or voids. Treatment decisions for prostate cancer patients are largely based on the pathological analysis of the geometry and topology of these data. As a second scenario we consider the comparison of road networks.

### Posters

Modified Python Mapper

Mapper algorithm uses filter functions and the pre-image of the range of filter functions to obtain overlapping bins. In the modified python mapper version, I will introduce an alternate algorithm that avoids the use of filter functions to obtain overlapping bins. Though the benefit of using filter functions is undeniable, loss of information is inevitable in the process of mapping dataset to a parameter space and pulling it back to the data space. Thus, avoiding the filter function gives an alternative (maybe better) algorithm with minimal loss of information compared to the current python mapper algorithm.

Using TDA tools to study ring channel dynamics in the worm reproductive system

Contractile rings are structures of actin filaments that are important in egg cell formation, wound healing, and cell division. For example, the ring channels are essential in allowing passage of developing egg cells through the body of the worm C. elegans. To maintain the ring structure, these channels are regulated by the force of motor proteins such as myosins. We use complex agent-based model simulations to understand how different types of myosin motors regulate the filament architecture and maintain the contractile rings at a constant diameter. We analyze our preliminary simulation outputs using topological data analysis tools. Specifically, we calculate the first two Betti numbers for connected components and topological circles and visualize them over simulation time and topological persistence scale. Our preliminary work suggests that these TDA tools may be useful in investigating mechanistic differences in simulations of the motor-actin interactions.

The reflection distance between zigzag persistence modules

We use the reflection functors introduced by Bernstein, Gelfand, and Ponomarev in 1973 to define a metric on the space of all zigzag modules of a given length, which we call the reflection distance. Zigzag modules are compared by performing sequences of diamond transformations via reflections. We show that the reflection distance between two given zigzag modules of the same length is an upper bound for the Wasserstein distance between their respective persistence diagrams.

Persistent K&Atilde;&frac14;nneth Formula

We define a filtered product space X x Y of filtered topological spaces X and Y and give a relationship between the barcodes of X x Y in terms of barcodes of X and barcodes of Y.

Persistent Terrace for Topological Inference of Point Cloud Data

Topological data analysis (TDA) is a rapidly developing collection of methods for studying the shape of point cloud and other data types. One popular approach, designed to be robust to noise and outliers, is to first use a smoothing function to convert the point cloud into a manifold and then apply persistent homology to a Morse filtration. A significant challenge is that this smoothing process involves the choice of a parameter and persistent homology is highly sensitive to that choice; moreover, important scale information is lost. We propose a novel topological summary plot, called a persistence terrace, which incorporates a wide range of smoothing parameters similar to a scale-space analysis and is robust, multi-scale, and parameter-free. This plot allows one to isolate distinct topological signals that may have merged for any fixed value of the smoothing parameter, and it also allows one to infer the size and point density of the topological features.

Persistence and Curvature

We use Topological Data Analysis to estimate curvature of samples from model spaces of constant curvature. We connect birth and death of a 2-simplex with equal edge lengths to the curvature of the 2-simplex. We validate our intuition of the connection between persistence and curvature by constructing a statistical model using persistence landscapes, a sequence of functions derived from the persistence diagram. The discretized persistence landscape functions were used to construct a support vector regression model to estimate curvature.

On The Reeb Spaces of Definable Maps

Given a topological space X and a continuous function f : X ! R, define an equivalence relation ~ on X by setting x1 ~ x2 if f(x1) = f(x2) and x1; x2 are in the same connected component of f^-1(f(x1)) = f^-1(f(x2)). The space X= ~ is called the Reeb graph of f, denoted R(f). The concept of the Reeb graph was introduced by Georges Reeb as a tool in Morse theory. The notion of the Reeb graph can be generalized to the Reeb space by letting f : X ! Y , where Y is any topological space, and is an important object in applied topology. In our work, we study Reeb spaces of semi-algebraic maps. We observe that unlike in the case of Reeb graphs, the Betti numbers of the Reeb space of a map f : X ! Y could be arbitrarily large compared to those of X. The main result of our work is a singly exponential upper bound on the Betti numbers of the Reeb space in terms of the degrees of the polynomials defining the map f, and the sets X and Y.

A Barcode Embedding for Metric Graphs

Stable topological invariants are a cornerstone of persistence theory and applied topology, but their discriminative properties are often poorly-understood. In a 2015 paper, Dey, Wang, and Shi studied a rich homology-based invariant of metric graphs consisting of a family of persistence diagrams. Contrasting with their work, which mainly explored the stability and computability of this invariant, we explored its discriminativity, culminating in some precise inverse results that demonstrate its generic injectivity.

Theory of Interleavings on Categories with a Flow

The interleaving distance was originally defined in the field of Topological Data Analysis (TDA) by Chazal et al. as a metric on the class of persistence modules parametrized over the real line. Bubenik et al. subsequently extended the definition to categories of functors on a poset, the objects in these categories being regarded as generalized persistence modules'. These metrics typically depend on the choice of a lax semigroup of endomorphisms of the poset.

The purpose of this project is to develop a more general framework for the notion of interleaving distance using the theory of monoidal actions on categories'. Specifically, we extend the notion of interleaving distance to arbitrary categories equipped with a lax monoidal action by the monoid $[0,infty)$, which we call `categories with a flow'.

In this way, the class of objects in such a category acquires the structure of a Lawvere metric space. Functors that are colax $[0,infty)$-equivariant yield maps that are $1$-Lipschitz. This leads to concise proofs of various known stability results from TDA, by considering appropriate colax $[0,infty)$-equivariant functors. Along the way, we show that several common metrics, including the Hausdorff distance, the $L^{infty}$ distance on $R^n$ and extended $L^infty$ distance on $M$-spaces (where $M$ is any metric space), can be realized as interleaving distances in this general perspective.

Multirank: a multiparameter persistence invariant that captures birth and death

This poster introduces a novel multiparameter persistence invariant -- the multitrank function -- that captures geometric information about how homology classes move through a filtration, including how classes persist, merge, and branch. Multiranks are ranks of homomorphisms between direct sums of homology groups for finitely many topological subspaces, the usual rank function being the case where the source and target direct sums have one summand each. The multirank admits dual expressions for the births and deaths of a persistence module. A computationally friendly version of multirank, called m-rank, restricts the numbers of summands to be at most m; it determines all births and deaths when m is sufficiently large.

Using Persistent Homology to Quantify a Diurnal Cycle in Hurricane Felix

The tropical cyclone (TC) diurnal cycle is a regular, daily cycle in hurricanes that may have implications for the structure and intensity of hurricanes. This pattern can be seen in a cold ring forming in the inner core of the storm near sunset and propagating away from the storm center overnight, followed by warmer clouds on its inside edge. The current state of the art for diurnal cycle measurement has a limited ability to analyze the behavior beyond qualitative observations. Our method creates a more advanced mathematical method for quantifying the TC diurnal cycle using tools from Topological Data Analysis, specifically one dimensional persistent homology. Using geostationary operational environmental satellite (GOES) IR imagery data from Hurricane Felix in 2007, our method is able to detect an approximately daily cycle in the hurricane.

Multiparameter Persistence Landscapes

Persistent homology has proved a useful data analytic tool, and there have been various approaches to producing invariants amenable to machine learning and statistical analysis. One such summary that has seen wide ranging applications is the persistence landscape. The theory of multiparameter persistence modules is unlike the single parameter case where we may associate a barcode to a module, there is not an analogous complete discrete invariant in the multiparameter setting. We propose an incomplete stable invariant derived from the rank invariant the multiparameter persistence landscape. Our invariant is a Lipschitz function in Lebesgue space. As such it is naturally endowed with a distance function and is well suited to statistical analysis. Moreover the landscape has a natural kernel and may be discretised and fed into machine learning algorithms.

Stabilizing the unstable output of persistent homology computations

The persistent homology algorithm is generally seen as a procedure which starts with a filtered complex and ends with a persistence diagram. This procedure is stable (at least to certain types of perturbations of the input), justifying the use of the diagram as a signature of the input and the use of features derived from it in machine learning. However, these computations also produce other potentially useful but unstable information. We recast these problems as real-valued functions which are discontinuous but measurable, and then observe that convolving such a function with a suitable function produces a Lipschitz function. The resulting stable function can be estimated by perturbing the input and averaging the output. This is joint work with Paul Bendich and Peter Bubenik.

Persistent homology is commonly applied to study the shape of high dimensional data sets. In this talk we consider two low-dimensional data scenarios in which both geometry and topology are of importance. We study 2D and 3D images of prostat

Computing the optimal cycle of a given homology class is a challenging problem. Existing algorithms, although polynomial, are far from practical. We propose a new algorithm based on the A-star search. A carefully designed heuristic function

Persistence modules are a central algebraic object arising in topological data analysis. Interleaving provides a natural way to measure distances between persistence modules. We consider various classes of persistence modules and the relatio

Estimating the mode or modal-sets (i.e. extrema points or surfaces) of an unknown density from sample is a basic problem in data analysis. Such estimation is relevant to other problems such as clustering, outlier detection, or can simply ser

Topological Data Analysis has been applied to a number of problems in order to study the geometric and topological structure of point cloud data. In this talk, we discuss their application to the analysis of time series from cyborg-insect to

Real multiparameter persistent homology
Ezra Miller

Persistent homology with multiple continuous parameters presents fundamental challenges different from those arising with one real or multiple discrete parameters: no existing algebraic theory applies (even poorly or inadequately). In part t

Object Oriented Data Analysis is the statistical analysis of populations of complex objects. This is seen to be particularly relevant in the Big Data era, where it is argued that an even larger scale challenge is Complex Data. Data objects w

Approximating Convex Bodies and Applications
David Mount

Recently, a series of seminal results has established the central role that convex approximation plays in solving a number of computational problems in Euclidean spaces of constant dimension. These include fundamental applications such as co

Combinatorics, Topology and Genomics
Laxmi Parida

Many problems in genomics are combinatorial in nature, i.e., the interrelationship amongst the entities is at par, if not more important, than the value of the entity themselves.

Graphs are the most commonly used ma

Many systems employ sensors to interpret the environment.

The target-tracking task is to gather sensor data from the environment and then to partition these data into tracks that are produced by the same target. A k

The Reeb Graph Edit Distance is Universal
Ulrich Bauer

We consider the setting of Reeb graphs of piecewise linear functions and study distances between them that are stable, meaning that functions which are similar in the supremum norm ought to have similar Reeb graphs. We define an edit distanc

Modeling local reconnection events along DNA molecules
Mariel Vazquez

In Escherichia coli DNA replication yields two interlinked circular chromosomes. Returning the chromosomes to an unlinked monomeric state is essential to cell survival. Simplification of DNA topology is mediated by enzymes, such as recombina

Videos

### Print

Full Schedule Participant List