While there is a great deal of work going into making databases fast so that you can process the wealth of data you collect in today's world, it still remains useful to do sampling to obtain a quick estimate of a result. But in today's world, you often have no idea (except perhaps an intuition) how accurate that data is. People have figured it out in some cases, but not in a way that is at all general. This paper serves to solve that, with a main contribution of the idea of a GUS (generalized uniform sampling) operator which can describe any sampling method, as well as the idea of Second Order Analytic (SOA) equivalence to denote the equivalence of two query plans in terms of expected value and variance. Combining these with supported relational operators provides a way to reason about and calculate error bounds for nearly any type of sampling.
It seems to be that this is new because as data volumes grow, it becomes ever more important to be able to sample and sample confidently.
There is a trade-off here in that the generality of the GUS also makes it harder to reason about - I certainly had trouble really understanding what things meant.
I can see this being influential in 10 years; they provide a general use tool for others to use in a space that can only become increasingly more important as we move forward.
No comments:
Post a Comment