UC Santa Cruz will join three other institutions to establish a transdisciplinary research institute bringing together mathematicians, statisticians, and theoretical computer scientists to develop the theoretical foundations of the fast-growing field of data science.

Funded by the National Science Foundation (NSF), the Institute for Foundations of Data Science (IFDS) is a collaboration between the University of Washington, UC Santa Cruz, University of Wisconsin-Madison, and University of Chicago. Its mission is to develop a principled approach to the analysis of the ever-larger, more complex, and potentially biased data sets that play an increasingly important role in industry, government, and academia. The institute’s research will lead to methods that are more computationally efficient, robust to errors and incomplete or ambiguous data, and better able to respond and act in changing environments.

Advertisement

Support for the IFDS comes from a $12.5 million grant from the NSF’s Transdisciplinary Research in Principles of Data Science (TRIPODS) program. It is one of two institutes nationwide receiving the first TRIPODS Phase II awards, building on earlier Phase I efforts that began in 2017 at UC Santa Cruz and 11 other universities.

UCSC researchers led by Lise Getoor, professor of computer science and engineering in the Baskin School of Engineering, have been focused on developing a theory of data science applied to uncertain and heterogeneous graph and network data. In the new institute, they will be leveraging their successes in Phase I, with a new emphasis on the ethical and societal implications of data-driven algorithms.

“We want to be sure that while we’re looking at optimization methods in data science, we connect that with issues of fairness, bias, and privacy,” said Getoor, who will be teaching her course on Ethics and Algorithms for the second time this fall.

“Any kind of algorithmic decision making can end up being biased with respect to race, gender, and so on,” she said. “One of our concerns is the issue of feedback loops by which these biased decisions get reinforced and can cascade in unanticipated and undesirable ways.”

Statistician Abel Rodriguez added that the data science community needs to do a better job of explaining how to interpret the output of machine learning algorithms.

“People need to be more skeptical. There is a tendency to interpret the output as indicating causality, when in fact the result is just an association that has nothing to do with the actual process you are investigating,” said Rodriguez, who will be based at the University of Washington while continuing his UCSC affiliation as a visiting professor and serving as the diversity liaison for the new institute.

Privacy is also a continuing challenge in data science. In 2018, UCSC received an NSF TRIPODS+X grant to investigate ways to protect the privacy of individuals while allowing access to large genomic data sets, a project led by Abhradeep Guha Thakurta, assistant professor of computer science and engineering.

The IFDS will build on strong collaborative relationships that already exist between the member institutions, Getoor said. The five-year funding plan for the institute includes $2.3 million for UC Santa Cruz. In addition to its research agenda, the institute will engage the data science community through workshops, summer schools, and hackathons.

At UC Santa Cruz, the IFDS will support education and outreach activities through the Everett Program, the Center for Public PhilosophyGirls Who Code, and the local chapter of the Association for Computing Machinery – Women. Planned activities include the Everett Program’s “impactathons” in support of local nonprofits and the Center for Public Philosophy’s Ethics Bowl programs.

In addition to Getoor and Rodriguez, the lead researchers at UC Santa Cruz include C. “Sesh” Seshadhri, associate professor of computer science and engineering, and Daniele Venturi, associate professor of applied mathematics. The TRIPODS effort at UCSC includes faculty in the Departments of Computer Science and Engineering, Statistics, and Applied Mathematics, and also has close ties with the D3 Data Science Research Center.

TRIPODS is intimately connected to the NSF’s Harnessing the Data Revolution (HDR) program, which aims to accelerate discovery and innovation in data science algorithms, data cyberinfrastructure, and education and workforce development.

NSF Division Director for Mathematical Sciences Juan Meza said in a statement, “With NSF’s $25 million investment, these interdisciplinary teams will be able to tackle some of the most important theoretical and technical questions in data science.”