Research
I am broadly interested in the intersection of data management and machine learning. My long-term goal is to enhance the robustness, usability, and usefulness of database management systems (DBMS) across diverse real-world scenarios and applications. Here are several themes I am working on:
-
Robust query-driven learned database components:
Learned database systems address several weaknesses of traditional query optimizers. However, they rely heavily on the queries they were trained on and can be unpredictable when faced with out-of-distribution (OOD) test queries. To mitigate this issue, we proposed ShiftHandler [SIGMOD '24], a framework that models shifting workloads and allows for rapid and effective retraining of learned database components, using a replay buffer.
More recently, we proposed incorporating domain knowledge [Arxiv '24] to enhance the robustness of learned databases. Additionally, we introduced a practical theory of generalization [Under Revision '25] in query-driven selectivity learning, offering theoretical understanding of OOD generalization for query-driven selectivity estimation models.
-
Making DBMS usable and useful under strict privacy constraints:
To improve the performance of DBMS (like query optimization and index recommendations) and to benchmark DBMS products, access to data is essential. However, in real production environments, such as cloud DBMS services, accessing user data is often not allowed due to privacy concerns. To address this, we develop SAM [SIGMOD '22], a tool that generates data using a supervised autoregressive model trained on query workloads. This allows us to benchmark DBMS products without directly accessing the data. More recently, we developed a query-driven cardinality estimation framework [In Submission '25] that works with imperfect query workloads. This framework can be used to support index recommendations without looking at the data.
-
Systems for data exploration and research:
We developed data systems that facilitate various kinds of data exploration and research. Working with collaborators from Meta and AWS, we designed and implemented QuoteInspector [VLDB '24 Demo], for monitoring, querying, and inspecting the flow of social discourse. Previously, I contributed to the dbCAN [Nucleic Acids Research '18] project by designing and implementing the core search functionality on genomic data, a key component of the system.
I used to work on machine learning for cardinality estimation ([SIGMOD 2021], [VLDB 2022], [EDBT 2022]).