CIS 650 Project Options
I am generally amenable to most projects that are of reasonable scope. Many
projects have the option of being done individually or in groups of 2. I
believe that implementation projects are important because they allow us to
actually evaluate ideas. (Theory projects may also be acceptable, but they are
more hit-or-miss, so I generally advise against them.)
What is presented here is a sketch of project ideas. Your first
assignment will be to pick a topic, perhaps find a partner, and do a bit of
initial reading and thinking (and perhaps talking with me) about the
problem. This will be due Wednesday 2/5.
Resources for Finding More Information
If you're looking to do a bit of background reading, there are two very
important sites for finding research papers. From both, you can do searches by
author or by title.
Note that PostgreSQL and MySQL are open-source databases that have versions
for most platforms. The Tukwila system is not yet open-source, but I plan to
release it under an open-source license in the near future. SHORE and Berkeley
DB are storage systems that also have source code available.
Implementation Projects
Here are some suggested project topics. In all cases, you are expected to
not only implement something, but to demonstrate that it works on some
experimental workload:
- Building an adaptive XML database: The Tukwila data integration
system contains a query engine that is designed for remote data. Local data
poses an interesting set of challenges in which Tukwila's adaptive query
processing techniques may also be useful. The project here would be to add a
storage system to Tukwila and to support queries over it. Take the Tukwila
execution engine and add support for local storage, using an existing storage
package, e.g., SHORE or Berkeley DB. (Another alternative might be the storage
engine from PostgreSQL.) Choose an encoding for XML and write a simple
application to take an XML file and store it locally. Add a "scan" operator to
the Tukwila engine that retrieves data from this storage system.
- Two-person project: As above, but add support for multiple indices and
storage formats for the XML. Enhance the Tukwila optimizer and cost model
to take different access methods into account, and to choose algorithms and
access paths according to their relative costs.
Suggested readings: documentation for SHORE, Berkeley DB, Tukwila codebases;
papers on XML storage and indexing.
- SQL with ranked keyword search: Define a set of query language
extensions/operations to XQuery or SQL that support keywords. Build a
middleware layer over a relational database (Oracle, DB2, PostgreSQL) that
takes queries in this language, creates and utilizes inverted indices or other
structures, and makes use of these structures to answer queries. The
implementation should rank results using TF/IDF or some other common
ranked-results metric.
- Two person project: As above, but also design and build a web crawler for
the keyword engine, which indexes not only documents, but also meta-information
(source web site, document date, document type, etc.)
Suggested readings:
- The XQuery keyword search paper in your reader
- The Information Retrieval survey paper in your reader
- Chapter 2, Modern Information Retrieval (see me in person)
- Text extensions to SQL, e.g., those from SQL Server
- Optimization rules for XQuery Build an XQuery front-end that does
heuristic transformational optimizations and query "normalization." Make use
of XML Schema information wherever possible. (Example techniques include
propagating constants, removing redundant paths and variables, eliminating "*"
or "//" operators, etc.) Output the results either in "cleaner" XQuery or in
Tukwila's pre-processed plan format.
Suggested readings:
- Teaching an old DB new tricks Take an existing database system such
as PostgreSQL or mySQL and add a new indexing structure, e.g., bitmap indices
or join indices. Add the appropriate operators and cost estimates to the
system so it can make use of the new indices.
- Two-person variant: Create a "workload tuning wizard" that tries to find
the best combination of bitmapped and B-tree indices for a query workload.
Don't use exhaustive search --- use pruning and heuristics.
Suggested readings:
- Data placement (two-person) Assume you are given a cluster of
machines, each running your DBMS of choice, and two or more databases (e.g.,
TPC-H, TPC-C, XMark, etc.). Define a workload generator that creates queries
according to "common" patterns (perhaps inspired by those in the benchmark
DBs). Build a middleware layer that (1) materializes the results of queries at
various nodes in the system, and (2) makes use of previous results to answer
future queries, using Answering Queries Using Views algorithms. Compare
several policies for what data to materialize, what to evict, and where to
place it.
Suggested readings:
- Answering queries using views literature (start with the views survey)
- Voelker et al. paper on Global Memory System
- Operating systems textbooks -- chapter on page replacement
- ... Or propose your own idea!
For all implementation projects, a report (5-10 pages) describing the
problem you are addressing, your motivations, the approach and implementation,
and your contributions is required. Your paper should be structured much like
a conference paper, with a short abstract, introduction, body, related work,
and conclusion.
Survey and Analysis Paper Project
I am also willing to accept a survey and analysis paper as a term project.
This paper should be approximately 15 pages and should be a survey of a
research topic to which I agree, and it must analyze a minimum of 4 related
papers. One such project is the following:
- Study collaborative applications such as Groove, Lotus Notes/Domino;
collaborative web sites like Slashdot, Mozilla.org, and Sourceforge. Write a
10-15 page report comparing and contrasting the tools and identifying a set of
data management capabilities that lies at the core of these applications. Give
a presentation towards the end of the semester summarizing the results.
Advice on Writing
Zack Ives
Last modified: Fri Jan 31 11:25:08 EST 2003