Fractal Mining: Using fractals for knowledge discovery in large datasets.

    MOTIVATION
    The need for scalable and incremental data mining tools have increased along with the ability of collecting and storing more and more data in organizations. In this project we investigate a new paradigm for data mining: the use of fractals as a tool to extract knowledge from datasets. Fractal sets have the property of being self-similar, that is, invariant with respect to the scale used to look at them. Nature is filled with examples of sets that, although seemingly chaotic in behavior, exhibit self-similarity. Fractals are characterized by their fractal dimension, which roughly speaking, characterizes the number of dimensions ``filled'' by the object represented by the datasets. The conjecture we exploit in this project to capture knowledge in datasets is that if a part of a dataset brings about a change in the overall fractal dimension, this part is ``anomalous'' with respect to the rest of the set. We will use this conjecture to research and develop techniques for knowledge discovery. Concretely, we aim to:
    • Research and develop an new algorithm for data clustering, based on the concept of fractal dimension. This algorithm will be scalable with the size of the dataset and number of dimensions, and will be resilient to noise.
    • Use the fractal dimension as a measure to compute changes in dataset behavior. This has two important ramifications: quantify how well a clustering model built for a dataset fits another one, and quantify the difference between two different clustering models. The tools derived from these techniques will be useful for analysts who want to monitor how a dataset evolves in time (e.g., weekly sales of an organization), or want to compare how datasets differ across parts of the organization.
    • Use the fractal dimension as a dissimilarity function in algorithms designed to detect deviations in datasets. We believe that the fractal dimension is a natural way to quantify deviations between sets and thus can replace other functions in known algorithms for capturing deviations, improving the quality of their results.
    • Use the fractal dimension to implement an algorithm that uncovers the behavior of association rules in the time dimension. For instance, one would be interested in finding out if the common occurence of an itemset such as {beer, diapers} is clustered in some region of time such as the evening, or if it exhibits a self-similar behavior: i.e., occurs with the same distribution regardless of the scale used.
    • Use the clustering and data deviation techniques described before to discover patterns of special values in datacubes (such as null cells and cells with large values). This technique will be able to uncover important knowledge, such as clusters of stores that fail to sell products and so on.
    Concepts, demo

    A fractal image generated by Fractint

    Luray Caverns, Luray, Virginia
    PUBLICATIONS
    • Chaotic Mining: Knowledge Discovery Using the Fractal Dimension, Proceedings of the 1999 SIGMOD DMKD Workshop, Philadelphia, PA, June 1999. (Postcript)
    • Using the fractal dimension to cluster datasets, Proceedings
      of the 2000 International Conference in Knowledge Discovery and Data
      Mining (ACM-SIGKDD), D. Barbará and P. Chen
      (pdf)
    • Tracking Clusters in Evolving Data Sets, Proceedings
      of the 14th Int'l Florida Artificial Intelligence Conference
      D. Barbará and P. Chen pdf
    • Fractal mining of interval association rules, Tech report (pdf)
    PEOPLE

Back to Daniel Barbará's page