Fractal Mining:
Using fractals for knowledge discovery in large datasets.
MOTIVATION
The need for scalable and incremental data mining tools have increased
along with the ability of collecting and storing more and more data in
organizations.
In this project we investigate a new paradigm for data mining: the use
of fractals as a tool to extract knowledge from datasets. Fractal sets
have the property of being self-similar, that is, invariant
with respect to the scale used to look at them.
Nature is filled with examples of sets that, although seemingly chaotic
in behavior, exhibit self-similarity. Fractals are characterized
by their fractal dimension, which roughly speaking, characterizes
the number of dimensions ``filled'' by the object represented by the
datasets. The conjecture we exploit in this project to capture knowledge
in datasets is that if a part of
a dataset brings about a change in the overall fractal dimension, this
part is ``anomalous'' with respect to the rest of the set.
We will use this conjecture to research and develop techniques for
knowledge discovery. Concretely, we aim to:
- Research and develop an new algorithm for data clustering, based on
the concept of fractal dimension. This algorithm will be scalable with
the size of the dataset and number of dimensions, and will be resilient
to noise.
- Use the fractal dimension as a measure to compute changes in dataset behavior.
This has two important ramifications: quantify how well a clustering model built
for a dataset fits another one, and quantify the difference between two different
clustering models. The tools derived from these techniques will be useful
for analysts who want to monitor how a dataset evolves in time (e.g., weekly sales
of an organization), or want to compare how datasets differ across parts of the
organization.
- Use the fractal dimension as a dissimilarity function in algorithms designed
to detect deviations in datasets. We believe that the fractal dimension is a natural
way to quantify deviations between sets and thus can replace other functions in
known algorithms for capturing deviations, improving the quality of their results.
- Use the fractal dimension to implement an algorithm that uncovers the
behavior of association rules in the time dimension. For instance, one would
be interested in finding out if the common occurence of an itemset such as
{beer, diapers} is clustered in some region of time such as the evening, or
if it exhibits a self-similar behavior: i.e., occurs with the same distribution
regardless of the scale used.
- Use the clustering and data deviation techniques described before to discover
patterns of special values in datacubes (such as null cells and cells with large
values). This technique will be able to uncover important knowledge, such as
clusters of stores that fail to sell products and so on.
Concepts, demo
A fractal image generated by Fractint
|
Luray Caverns, Luray, Virginia
|
PUBLICATIONS
- Chaotic Mining: Knowledge Discovery Using the Fractal Dimension,
Proceedings of the 1999 SIGMOD DMKD Workshop, Philadelphia, PA,
June 1999. (Postcript)
- Using the fractal dimension to cluster datasets, Proceedings
of the 2000 International Conference in Knowledge Discovery and Data
Mining (ACM-SIGKDD), D. Barbará and P. Chen
(pdf)
- Tracking Clusters in Evolving Data Sets, Proceedings
of the 14th Int'l Florida Artificial Intelligence Conference
D. Barbará and P. Chen pdf
- Fractal mining of interval association rules, Tech report
(pdf)
|
PEOPLE
|
Back to Daniel Barbará's page