Managing the Collaborative Sharing of Evolving Data

One of the most prevalent problems today is the need to map data from one database to another -- where the databases may potentially have different schemas and interfaces. Examples include everything from bibliographic citation databases to course grade sheets to the ACM Digital Library. Once data is mapped, it is frequently modified in multiple places at once, and the challenge lies in "synchronizing" or reconciling the modifications.

Project Overview

The ORCHESTRA project focuses on the challenges of such data sharing scenarios in the sciences -- specifically addressing the challenges in bioinformatics. In this domain, there are a great many "standardized" databases with overlapping information, similar but not identical data, differing levels of data quality/confidence, and a variety of different target audiences. In general, each database owner would like to store a "live" view of all relevant knowledge in its domain -- however, each site is being independently extended, corrected, and analyzed. Moreover, individual biologists would like to be able to download and maintain local "live snapshots" of data in order to run their own experiments. Unfortunately, there is often no consensus on what the best data is -- certain data items will always be disputed or revised. Our focus in the ORCHESTRA collaborative data sharing system (CDSS) is on how to support reconciliation across different schemas, with disagreeing users. In general, each participant in the system specifies whom it trusts, and this is used to locally resolve conflicts.

Click on any of the images below to see a larger version.

Basic Process

The figure to the right illustrates the basic functionality of ORCHESTRA. The system coordinates among a set of participating sites, each of whom manages a database. Schema mappings describe how the data at these sites relates. Trust conditions specify which sites trust which data (and how much). The system allows all of the sites to be continuously updated, and on demand, it will propagate these updates across sites, according to the specified schema mappings and trust.

Research Topics

The ORCHESTRA project touches on a number of important database- related topics, including update translation across mappings or views; conditional information; peer-to-peer data sharing; data provenance; and more. This project takes our past work on the Piazza system one step further in supporting decentralization. See the list of publications below for further details.

System Implementation

ORCHESTRA uses a peer-to-peer implementation that requires a runtime on each machine with a database, and additional computation and storage nodes can be hosted on the cloud.

We have recently released the source code of the first prototype ORCHESTRA system. We will continue to improve the distribution's flexibility and installation options. Currently we are happy to arrange for demonstrations and trial deployments here at Penn.

New: we gave a demonstration of the prototype ORCHESTRA system at SIGMOD 2007 and DILS 2007.

A video demonstration can be found here.

Here are some screen shots:

This is the main ORCHESTRA screen, showing a series of biological databases (ellipse nodes) and mappings among them (arcs with "Mx" labels). The PCBI PlasmoDB database has been highlighted.

This is the ORCHESTRA provenance viewer, which shows how a given data value (the tuple selected from the list on the right side of the screen) was produced. In this case, the tuple is highlighted graphically in green, and the arrows going into it represent sources from which it was derived. This tuple was derived from Mapping M5, which combined three tuples, which were in turn direct user insertions (the "+"s in the diamond vertices). In general, derivations can be significantly more complex.

Related Publications

Grigoris Karvounarakis, Zachary G. Ives, Val Tannen. Querying Data Provenance. SIGMOD 2010.
Partha Pratim Talukdar, Zachary G. Ives, Fernando Pereira. Automatically Incorporating New Sources in Keyword Search-Based Data Integration. SIGMOD 2010.
Todd J. Green, Zachary G. Ives, Val Tannen. Reconciling Differences. ICDE 2009.
Nicholas E. Taylor, Zachary G. Ives. Reliable Storage and Querying for Collaborative Data Sharing Systems. Full paper, ICDE 2010.
Todd J. Green, Zachary G. Ives, and Val Tannen. Reconciling Differences. ICDT 2009.
The Orchestra Collaborative Data Sharing System. Zachary G. Ives, Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, Fernando Pereira. ACM SIGMOD Record, September 2008.
Partha Pratim Talukdar, Marie Jacob, M. Salman Mehmood, Koby Crammer, Zachary G. Ives, Fernando Pereira, and Sudipto Guha. Learning to Create Data-Integrating Queries. VLDB 2008.
Grigoris Karvounarakis and Zachary Ives. Bidirectional Mappings for Data and Update Exchange. WebDB 2008.
Todd J. Green, Zachary G. Ives, Grigoris Karvounarkis, Val Tannen. Update Exchange with Mappings and Provenance, to appear, VLDB 2007.
Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor, Olivier Biton, Zachary G. Ives, Val Tannen. ORCHESTRA: Facilitating Collaborative Data Sharing. Demonstration description, SIGMOD 2007.
Todd J. Green, Grigoris Karvounarakis, Val Tannen. Provenance Semirings. PODS 2007.
Nicholas Taylor, Zachary Ives. Reconciling Changes while Tolerating Disagreement in Collaborative Data Sharing. SIGMOD 2006, Chicago, IL.
Zachary G. Ives, Nitin Khandelwal, Aneesh Kapur, Murat Cakir. ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data. Conference on Innovative Database systems Research (CIDR), Asilomar, CA, 2005.

Team Members

Prof. Zachary Ives
Prof. Val Tannen

Team Alumni

Olivier Biton
Murat Cakir
Charuta Joshi
Aneesh Kapur
Ivan Terziev
Mike Wittie
Nitin Khandelwal (first position: Oracle)
TJ Green (first position: UC Davis)
Grigoris Karvounarakis (first position: LogicBlox)
Nick Taylor (first position: Google)
Soeren Auer (first position: U. Leipzig)

Sponsorship

This research has been funded by NSF CAREER grant award #IIS-0477972, awarded to Zachary G. Ives at the University of Pennsylvania, and NSF SEIII grant #IIS-0513778.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Last modified: Wed Jul 18 12:03:16 EST 2007