- Gates&Hillman Centers
- ATREYEE MAITI
- Master's Student
- Department of Electrical and Computer Engineering
- Carnegie Mellon University
Intelligent Data Clustering for Cold Data Storage in Main Memory Database Systems
Transactions in on-line transaction processing (OLTP) workloads typically have the following characteristics: (1) they are short-lived, (2) they work on a small subset of the data, (3) they are repetitive. Traditional disk-based database management systems (DBMS) incur too much overhead for OLTP datasets that could simply be memory resident. This is because of the presence of heavyweight concurrency control and recovery mechanisms. Given the OLTP application characteristics, poor performance of traditional disk-based systems and the large amount of main memory available in machines today, main memory databases have emerged as the right choice for OLTP applications.
However, these systems are restrained by datasets that fit in main memory. It is possible to overcome this constraint by using tech niques like anti-caching, OS paging. These techniques rely on identifying cold data in main memory databases and pushing it out to secondary storage devices such as solid-state disk (SSD) or hard disk drive (HDD), in order to make space for new data in memory. When a transaction accesses evicted data, the DBMS retrieves it and merges it back into the main memory storage. The overhead of this retrieval can be significant if the transaction needs data from multiple tables but that data is stored in different locations on disk. These techniques fail to leverage the inherent hierarchical nature of OLTP databases. The correlation among tables and the probability of correlated tuples being accessed together can be used to ensure intelligent eviction of cold data and its retrieval thereafter. This research explores the use of data clustering to combine related tuples together to reduce the number of disk operations when using out-of-memory secondary storage in high-performance OLTP DBMS. This technique stores tuples that are likely to be needed together in a single block during eviction. As a result, they are fetched in a single disk read instead of multiple disk reads. This greatly reduces latency and increases throughput.