CIS 650 Project Options

I am generally amenable to most projects that are of reasonable scope. Many projects have the option of being done individually or in groups of 2. I believe that implementation projects are important because they allow us to actually evaluate ideas. (Theory projects may also be acceptable, but they are more hit-or-miss, so I generally advise against them.)

What is presented here is a sketch of project ideas. Your first assignment will be to pick a topic, perhaps find a partner, and do a bit of initial reading and thinking (and perhaps talking with me) about the problem. This will be due Wednesday 2/5.

Resources for Finding More Information

If you're looking to do a bit of background reading, there are two very important sites for finding research papers. From both, you can do searches by author or by title.

Note that PostgreSQL and MySQL are open-source databases that have versions for most platforms. The Tukwila system is not yet open-source, but I plan to release it under an open-source license in the near future. SHORE and Berkeley DB are storage systems that also have source code available.

Implementation Projects

Here are some suggested project topics. In all cases, you are expected to not only implement something, but to demonstrate that it works on some experimental workload:

Building an adaptive XML database: The Tukwila data integration system contains a query engine that is designed for remote data. Local data poses an interesting set of challenges in which Tukwila's adaptive query processing techniques may also be useful. The project here would be to add a storage system to Tukwila and to support queries over it. Take the Tukwila execution engine and add support for local storage, using an existing storage package, e.g., SHORE or Berkeley DB. (Another alternative might be the storage engine from PostgreSQL.) Choose an encoding for XML and write a simple application to take an XML file and store it locally. Add a "scan" operator to the Tukwila engine that retrieves data from this storage system.

Two-person project: As above, but add support for multiple indices and storage formats for the XML. Enhance the Tukwila optimizer and cost model to take different access methods into account, and to choose algorithms and access paths according to their relative costs.

Suggested readings: documentation for SHORE, Berkeley DB, Tukwila codebases; papers on XML storage and indexing.

SHORE documentation or Berkeley DB documentation
Tukwila source code overview and documentation
Shanmugasundaram et al. paper on XML storage, VLDB 99

SQL with ranked keyword search: Define a set of query language extensions/operations to XQuery or SQL that support keywords. Build a middleware layer over a relational database (Oracle, DB2, PostgreSQL) that takes queries in this language, creates and utilizes inverted indices or other structures, and makes use of these structures to answer queries. The implementation should rank results using TF/IDF or some other common ranked-results metric.
- Two person project: As above, but also design and build a web crawler for the keyword engine, which indexes not only documents, but also meta-information (source web site, document date, document type, etc.)
Suggested readings:
- The XQuery keyword search paper in your reader
- The Information Retrieval survey paper in your reader
- Chapter 2, Modern Information Retrieval (see me in person)
- Text extensions to SQL, e.g., those from SQL Server
Optimization rules for XQuery Build an XQuery front-end that does heuristic transformational optimizations and query "normalization." Make use of XML Schema information wherever possible. (Example techniques include propagating constants, removing redundant paths and variables, eliminating "*" or "//" operators, etc.) Output the results either in "cleaner" XQuery or in Tukwila's pre-processed plan format.
Suggested readings:
- XQuery language specification
- Manolescu et al. paper on XQuery to SQL translation, VLDB 2002.
Teaching an old DB new tricks Take an existing database system such as PostgreSQL or mySQL and add a new indexing structure, e.g., bitmap indices or join indices. Add the appropriate operators and cost estimates to the system so it can make use of the new indices.
- Two-person variant: Create a "workload tuning wizard" that tries to find the best combination of bitmapped and B-tree indices for a query workload. Don't use exhaustive search --- use pruning and heuristics.
Suggested readings:
- Papers on bitmap indexing, including Chan and Ioannidis, SIGMOD 98
- PostgreSQL documentation
- Chaudhuri and Narassaya paper on SQL Server index selection, VLDB 97 (for two person variant)
Data placement (two-person) Assume you are given a cluster of machines, each running your DBMS of choice, and two or more databases (e.g., TPC-H, TPC-C, XMark, etc.). Define a workload generator that creates queries according to "common" patterns (perhaps inspired by those in the benchmark DBs). Build a middleware layer that (1) materializes the results of queries at various nodes in the system, and (2) makes use of previous results to answer future queries, using Answering Queries Using Views algorithms. Compare several policies for what data to materialize, what to evict, and where to place it.
Suggested readings:
- Answering queries using views literature (start with the views survey)
- Voelker et al. paper on Global Memory System
- Operating systems textbooks -- chapter on page replacement
... Or propose your own idea!

For all implementation projects, a report (5-10 pages) describing the problem you are addressing, your motivations, the approach and implementation, and your contributions is required. Your paper should be structured much like a conference paper, with a short abstract, introduction, body, related work, and conclusion.

Survey and Analysis Paper Project

I am also willing to accept a survey and analysis paper as a term project. This paper should be approximately 15 pages and should be a survey of a research topic to which I agree, and it must analyze a minimum of 4 related papers. One such project is the following:

Study collaborative applications such as Groove, Lotus Notes/Domino; collaborative web sites like Slashdot, Mozilla.org, and Sourceforge. Write a 10-15 page report comparing and contrasting the tools and identifying a set of data management capabilities that lies at the core of these applications. Give a presentation towards the end of the semester summarizing the results.

Advice on Writing

Advice on writing and presentation collected by Mark Leone. Especially note "How to give a good research talk" and "How to present a paper in theoretical computer science" (which is also relevant for non-theory papers).
Research resources collected by Renee Miller.
Written and oral presentation skills collected by Stephen Scott

Zack Ives

Last modified: Fri Jan 31 11:25:08 EST 2003