Logical Data Expiration
solutions: whenever large amounts of data are repeatedly collected
over a period of time, it is essential to have a clear approach to
identifying parts of the data no-longer needed and a policy that
allows disposing and/or archiving these parts of the data. Such
policies are necessary even if adding storage to accommodate an
ever-growing collection of data were possible, since the growing
amount of data needs to be examined during querying and in turn
leads to deterioration of query performance over time.
The approaches to data expiration range from ad-hoc administrative
policies or regulations to sophisticated data analysis-based
techniques. The approaches have, however, one thing in common:
intuitively, they try to identify the parts of the data collection
that are not needed in the future. The key to deciding if a piece of
information will be needed in the future lies in identifying what
queries can be asked over the collection of data and how the
collection can evolve from its current state. The various techniques
proposed in the literature differ in the way they identify parts of
data no longer needed.
This talk formalizes the notion of data expiration in terms of
how the data is used to answer queries. We survey existing
approaches to the problem in a unified framework and discuss their
features and limits, and the limits of data expiration based
techniques in general. The particular focus of the chapter is on
comparing the space performance of various data expiration methods.
Interestingly, the methods developed for data expiration are
almost directly applicable to processing standing queries over data
streams and to construction of synopses.