Database Group
ORCHESTRA
SHARQ
Aspen
Tukwila
Peer-to-Peer
Data Integration
| |
Managing the Collaborative Sharing of Evolving
Data
One of the most prevalent problems today is the need to map data
from one database to another -- where the databases may potentially have
different schemas and interfaces. Examples include everything from
bibliographic citation databases to course grade sheets to the ACM Digital
Library. Once data is mapped, it is frequently modified in multiple
places at once, and the challenge lies in "synchronizing" or
reconciling the modifications.
Project Overview
The ORCHESTRA project focuses on the challenges
of such data sharing scenarios in the sciences -- specifically addressing
the challenges in bioinformatics. In this domain, there are a great many
"standardized" databases with overlapping information, similar but not
identical data, differing levels of data quality/confidence, and a
variety of different target audiences. In general, each database owner
would
like to store a "live" view of all relevant knowledge in its domain --
however, each site is being independently extended, corrected, and
analyzed. Moreover, individual biologists would like to be able to
download and maintain local "live snapshots" of data in order to run
their own experiments. Unfortunately, there is often no consensus on
what the best data is -- certain data items will always be disputed or
revised. Our focus in the ORCHESTRA collaborative data sharing
system (CDSS) is on how to support reconciliation across different
schemas,
with disagreeing users. In general, each participant in the system
specifies whom it trusts, and this is used to locally resolve
conflicts.
Click on any of the images below to see a larger version.
Basic Process
The figure to the right illustrates the basic functionality of ORCHESTRA. The system coordinates among a set of participating
sites, each of whom manages a database. Schema mappings describe
how the data at these sites relates. Trust conditions specify which
sites trust which data (and how much). The system allows all of the sites
to be continuously updated, and on demand, it will propagate these updates
across sites, according to the specified schema mappings and trust.
Research Topics
The ORCHESTRA project touches on a number of important database-
related topics, including update translation across mappings or
views; conditional information; peer-to-peer data sharing; data provenance;
and more.
This project takes our past work on the Piazza system one step further
in supporting decentralization. See the list of publications below
for further details.
System Implementation
ORCHESTRA uses a peer-to-peer implementation that
requires a runtime on each machine with a database, and additional computation and storage nodes can be hosted on the cloud.
We have recently released the source code of the first prototype
ORCHESTRA system. We will continue to improve
the distribution's flexibility and installation options. Currently we are
happy to arrange for demonstrations and trial deployments here at Penn.
New: we gave a demonstration of the prototype
ORCHESTRA system at SIGMOD 2007 and DILS 2007.
A video demonstration can be found here.
Here are some screen shots:
This is the main ORCHESTRA screen, showing a series
of biological databases (ellipse nodes) and mappings among them (arcs with
"Mx" labels). The PCBI PlasmoDB database has been highlighted.
This is the ORCHESTRA provenance viewer, which
shows how a given data value (the tuple selected from the list on the right
side of the screen) was produced. In this case, the tuple is highlighted
graphically in green, and the arrows going into it represent sources from
which it was derived. This tuple was derived from Mapping M5, which combined
three tuples, which were in turn direct user insertions (the
"+"s in the diamond vertices). In general, derivations can be significantly
more complex.
Related Publications
- Grigoris Karvounarakis, Zachary G. Ives, Val Tannen. Querying Data Provenance. SIGMOD 2010.
- Partha Pratim Talukdar, Zachary G. Ives, Fernando Pereira. Automatically Incorporating New Sources in Keyword Search-Based Data Integration. SIGMOD 2010.
- Todd J. Green, Zachary G. Ives, Val Tannen. Reconciling Differences. ICDE 2009.
- Nicholas E. Taylor, Zachary G. Ives. Reliable Storage and Querying for Collaborative Data Sharing Systems. Full paper, ICDE 2010.
- Todd J. Green, Zachary G. Ives, and Val Tannen. Reconciling Differences. ICDT 2009.
- The Orchestra Collaborative Data Sharing
System. Zachary G. Ives, Todd J. Green, Grigoris Karvounarakis, Nicholas
E. Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, Fernando
Pereira. ACM SIGMOD Record, September 2008.
- Partha Pratim Talukdar, Marie Jacob, M. Salman Mehmood, Koby Crammer,
Zachary G. Ives, Fernando Pereira, and Sudipto Guha. Learning to Create
Data-Integrating Queries. VLDB 2008.
- Grigoris Karvounarakis and Zachary Ives. Bidirectional Mappings for
Data and Update Exchange. WebDB 2008.
- Todd J. Green, Zachary G. Ives, Grigoris Karvounarkis, Val
Tannen. Update Exchange with Mappings and Provenance, to appear, VLDB 2007.
- Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor, Olivier Biton,
Zachary G. Ives, Val Tannen. ORCHESTRA: Facilitating Collaborative Data
Sharing. Demonstration description, SIGMOD 2007.
- Todd J. Green, Grigoris Karvounarakis, Val Tannen. Provenance Semirings.
PODS 2007.
- Nicholas Taylor, Zachary Ives. Reconciling Changes while Tolerating
Disagreement in Collaborative Data Sharing. SIGMOD 2006,
Chicago, IL.
- Zachary G. Ives, Nitin Khandelwal, Aneesh Kapur, Murat Cakir.
ORCHESTRA: Rapid, Collaborative Sharing of
Dynamic Data. Conference on Innovative
Database systems Research (CIDR), Asilomar, CA, 2005.
Team Members
- Prof. Zachary Ives
- Prof. Val Tannen
Team Alumni
- Olivier Biton
- Murat Cakir
- Charuta Joshi
- Aneesh Kapur
- Ivan Terziev
- Mike Wittie
- Nitin Khandelwal (first position: Oracle)
- TJ Green (first position: UC Davis)
- Grigoris Karvounarakis (first position: LogicBlox)
- Nick Taylor (first position: Google)
- Soeren Auer (first position: U. Leipzig)
Sponsorship
This research has been funded by NSF CAREER grant award #IIS-0477972,
awarded to Zachary G. Ives at the University of Pennsylvania, and NSF SEIII
grant #IIS-0513778.
Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of
the National Science Foundation.
Last modified: Wed Jul 18 12:03:16 EST 2007
|