Skip to content

bnegreve/paraminer

Repository files navigation

ParaMiner is a generic, parallel algorithm for mining closed frequent patterns in transactional datasets.

The generic base of ParaMiner can be instanciated to solve different mining task. This version of the code comes with 3 instances of ParaMiner:

  • ParaMinerFIM to mine frequent closed itemsets,
  • ParaMinerCRG to mine closed relational graphs, and
  • ParaMinerGRI to mine graduals patterns.

See Section Built-in instances of ParaMiner for more details about ParaMiner existing instances.

Quick start

<<Quick start>>

Download ParaMiner

here: http://www.lamsade.dauphine.fr/~bnegrevergne/webpage/software/paraminer/paraminer-1.0.tar.gz (Or fetch it on github: https://github.com/bnegreve/paraminer)

Extract the archive

tar xzvf paraminer-1.0.tgz

Build ParaMiner

cd paraminer
./configure 
make
make install 

Default install directory is /usr/local. You can specify a different install directory with –prefix:

./configure --prefix=<path/to/install_dir>

If you do not wish to install, ParaMiner executable files can be found in the <paraminer>/src/ directory.

Run ParaMiner

You can now run a built-in instance to solve a standard pattern mining problem (see Section Built-in instances) or you can start writing your own instance of ParaMiner (see Section Creating an instance).

For example: the fim instance can be used to mine closed frequent itemset in a transactional dataset.

paraminer_itemsets data/fim/mushroom.dat 1500 -t 4

Will output all the closed frequent itemsets occurring in mushroom dataset with an absolute frequency of at least 1500. The -t switch specifies the number of threads. See paraminer_itemsets -h for further details.

Built-in instances of ParaMiner

<<Built-in instances>> In order to solve a pattern mining problem, you need the adequate instance of ParaMiner.

This version of ParaMiner comes with 3 instances:

  • ParaMinerFIM to mine frequent closed itemsets,
  • ParaMinerCRG to mine closed relational graphs, and
  • ParaMinerGRI to mine graduals patterns.

ParaMinerFIM for closed frequent itemset mining

<<ParaMinerFIM>>

Extract closed frequent itemset from a transactional dataset.

Input file format

<<fim input file format>>

The input file format is the standard ASCII format of the FIMI dataset repository:

  • The whole dataset is described by a single ASCII file;
  • each line is a transaction;
  • each transaction contains distinct items;
  • transactions must be ordered;
  • last line must be empty.

e.g. test.dat

1 2 4 6
1 2 3 5 7
2 3

is a valid dataset.

Running the ParaMinerFIM

You can mine closed frequent itemsets occurring in this dataset by executing the following command:

./paraminer_itemsets test.dat 2

Alternatively, if you have a multi-core/parallel computer with 8 cores, you can exploit them by executing the following command:

paraminer_itemsets test.dat 2 -t 8

Output format

  • each line is a frequent closed itemset;
  • frequency is stored at the end of the line into brackets.

e.g.

./paraminer_itemsets test.dat 2 

will generate the following output on the standard output:

2 (3)
1 2 (2)
3 2 (2)

The results can be stored by redirecting the standard output into a file:

./paraminer_itemsets test.dat 2 -t 1 > results.out

ParaMinerCRG for closed frequent connected relational graphs mining

<<ParaMinerCRG>>

Extract connected relational graphs from relational graphs datasets. Relational graphs (graphs with distinct labels)

Input format

A graph dataset is a directory containing a collection of ASCII files. Each ASCII file is the description of one graph from the dataset.

The files must have the following format:

  • the first line is the number of distinct vertexes labels in the graph dataset;
  • each following line is a triplet <vertex id> <vertex id> <edge value> describing one edge of the graph where:

<vertex id> are integer identifiers for the two vertexes of the edge. <edge value> is any real number. Edges with a value bellow the edge threshold (mandatory argument of ParaMinerCRG) are disregarded. This is typically used to simplify the overly complex graphs before the mining process. If unnecessary, use <edge value> = 1 for every edge.

For example:

10 
1 2 1 
2 3 1 
3 4 1 

Describes the following relational graph:

(2) -- (1)
 |
 |
(3) -- (4)

An graph dataset example can be found in <paraminer_directory>/data/crg/test.

Running ParaMinerCRG

You can mine closed connected relational graphs occurring in the example graph dataset by executing the following command:

./paraminer_cgraphs  <paraminer_directory>/data/crg/test 1 1

Output format

  • Each line is a list of edges that representing a connected subgraph that is frequent in the dataset.
  • The line ends with the frequency of the graph.

For example, mining the example dataset will generate the following outpout.

( 1, 2 ) (2)
( 3, 4 ) (2)
( 1, 2 ) ( 1, 4 ) ( 2, 2 ) ( 3, 4 ) (1)
( 1, 2 ) ( 3, 4 ) ( 2, 3 ) (1)
4 patterns mined

ParaMinerGRI for gradual pattern mining

<<ParaMinerGRI>> See [ 7 ] for more information about gradual patterns.

Creating a new instance of ParaMiner

<<Creating an instance>>

This section describe how to create your own instance of ParaMiner. You need to create a new instance if you want to mine a type of patterns that is not supported by any ParaMiner built-in instance.

For example let’s say we want to mine periodic patterns, which is not supported by default in ParaMiner.

First start by creating a paraminer_local_periodic.cpp file which will contain an implementation of the following C++ functions:

A selection criterion

In a function called membership_oracle(). The selection criterion to distingish candidate patterns from patterns.

It takes as an argument a closed pattern P and a possible augmentation element e. It must return a non-null value if and only if the candidate pattern P U {e} is a pattern.

For example for our closed dark pattern mining problem, it can be as simple as:

bool membership_oracle(P, e){
  return is_a_periodic_pattern(P U {e}); 
}

A closure operator

In a function called clo()

The closure operator can be used to limit the redundancy in the resulting set of Patterns. Takes a pattern as an argument, and returns a closed pattern. The identity function is a valid closure operator.

This function as to be a valid closure operator

clo(P){
  return P;
}

It is worth noting that ParaMiner’s efficiency relies on closed pattern. Therefore defining a closure operator according to the problem definition is usually a good idea. Many example of closure operators have been proposed in [ 2 ]. If your problem satisfies some properties a default closure operator (better than the identity) can be used. A section is dedicated to this in [ 1 ].

A main function

The main function is here to achieves three goals:

  1. Parse the command line arguments
  2. Load and pre-process the dataset
  3. Invoque the clogen() routine to start the exploration.

Parsing the command line arguments

You must start your main function by calling the parse_clogen_arguments(argc, argv) function. It will capture the arguments used by ParaMiner remove them from argv and decrease argc.

Loading the dataset

The dataset must be loaded into a table called tt which is of type TransactionTable.

If your dataset is stored as described in fim input file format, you can use the built-in function read_transaction_table() It takes two argument, the filename and the transaction table.

So far our clogen_local_dark.cpp file looks like this:

int main(int argc, char **argv){

load_transaction_table (&tt, argv[1])

...

}

Invoking the search space exploration

Once your dataset is loaded into tt, you must call the clogen() main routine with empty_set as an argument if you want to start the exploration from the emptyset.

Bugs and bug reports

Repport bugs and/or comments at: [email protected]

My FirstName is Benjamin My LastName is Negrevergne

Publications

<<Refs>>

Main publication:

(If you use ParaMiner for your your research, please cite this publication.)

[ 1 ] ParaMiner: A generic pattern mining algorithm for multi-core architectures [to appear] Benjamin Negrevergne · Alexandre Termier · Marie-Christine Rousset and Jean-François Méhaut DAMI/DMKD

Other important reads

[ 2 ] Arimura, H., & Uno, T. (2005). A polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. Algorithms and Computation, 724-737.

[ 3 ] Boley, M., Horváth, T., Poigné, A., & Wrobel, S. (2010). Listing closed sets of strongly accessible set systems with applications to data mining. Theoretical computer science, 411(3), 691-700.

[ 4 ] Benjamin Negrevergne. A Generic and Parallel Pattern Mining Algorithm for Multi-Core Architectures. PhD thesis, Grenoble University, 2011.

[ 5 ] Uno, T., Kiyomi, M., & Arimura, H. (2004, November). LCM ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI 04).

[ 6 ] Negrevergne, B., Termier, A., Méhaut, J., & Uno, T. (2010, June). Discovering closed frequent itemsets on multicore: Parallelizing computations and optimizing memory accesses. In High Performance Computing and Simulation (HPCS), 2010 International Conference on (pp. 521-528). IEEE.

Gradual itemset mining

[ 7 ] Anne Laurent, Benjamin Négrevergne, Nicolas Sicard, and Alexandre Termier. Pgp-mc: Towards a multicore parallel approach for mining gradual patterns. In DASFAA, pages 78-84, 2010.

Authors and license

<<Authors>>

Authors:

  • Benjamin Negrevergne
  • Alexandre Termier

It was developped at Grenoble University / LIG.

License: ParaMiner is distributed under the LGPLv3 See LICENSE file in source directory for more informations.

About

generic algorithm for frequent pattern mining

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published