The current stable version can be installed by the following command:
pip install densired
Use the following code to generate a skeleton. Parameters are listed below.
skeleton = datagen.densityDataGen()
Use the following code to obtain a dataset with n points from a skeleton. There are additional parameters, that are also listed below.
data = skeleton.generate_data(n)
datax = data[:,0:-1]
datay = data[:,-1]
To visualize the skeleton, call the following code. For higher-dimensionalities, either specify dcount to get all pairs of the dc most spread out dimensions or specify the desired dimensions directly with dims.
skeleton.display_cores(dims=[d1,d2,...], dcount=dc)
To visualize a dataset, call the following. Give the dataset as data. The flags show_radius and show_core decide whether to display the core radii and core centers, respectively. For higher-dimensionalities, as with dispaly_cores, either specify dcount to get all pairs of the dc most spread out dimensions or specify the desired dimensions directly with dims.
skeleton.display_data(data, show_radius=False, show_core=False, dims=[d1,d2,...], dcount=dc)
To initialize a stream, call the following function. The command-String controls the stream behavior. The default_duration is the default duration of a block of the stream. default_duration does not need to be specified, in which case it has a value of 1000. The command-String will be explained in more detail further below.
skeleton.init_stream(command=commandstring, default_duration = 1000)
In order to get an element from the stream, just use the skeleton as an iterator
x = skeleton.next()
To visualize a data stream, call the following.
skeleton.display_current_stream()
Alternatively, use this to set a stream command and display it in one command. Parameters are analogous to init_stream.
skeleton.display_stream(command=commandstring, default_duration = 1000)
Parameter | Abrev. | Default | Role |
---|---|---|---|
dim | dim | dimensionality of the data set | |
clunum | k | number of clusters | |
core_num | crn | number of cores across all clusters allows for explicit specification of core number per cluster | |
domain_size | ds | initial range of the starting position of the clusters | |
radius | r | base range around core center where points can be generated | |
step | base distance between cores | ||
dens_factors | cf | False | cluster density/scale factorsmultiplies both step size and radius for cluster assigns all clusters a density factor assign random dens_factors cf can also be set per cluster in a list |
momentum | cluster momentum factors assigns the value to all clusters as stickiness factor assigns sets random stickiness factors between can also be set per cluster in a list |
||
min_dist | o | overlap factor on the minimal distance between cores (sum of core sizes) | |
step_spread | w | width of normal distribution on shift | |
pos | p | starting core positions for clusters, unspecified if None, needs to be set per cluster in a list if the cluster could not be created for the specified position, the regular starting rules are used instead |
|
distributions | dist | distributions for random generation of data points for each core None or "uniform" uses uniform distribution within hypersphere "paper" is the distribution used for the paper at ECML PKDD 2024 and up to version 1.1.0 "gaussian" uses a multivariate Gaussian (covariance matrix being a diagonal matrix using the core radius) "studentt" uses multivariate Student's t distribution with degrees of freedom 2 if a number is given, uses multivariate Student's t distribution with degrees of freedom corresponding to the given number (shape being the core radius) can also be set per cluster in a list Note: using anything other than uniform or paper can lead to the data not being density-separable, radii and min_dists may need to be adjusted |
|
max_retry | mr | maximal number of attempts for generation before generation enters the failure handling | |
branch | chance of creating a branch from the prior scaffolding of the cluster done by restarting from a randomly selected core of the current cluster |
||
star | chance of the initial starting core being chosen for any attempt of restarting applies to both failure and branching remaining cores all have an equal probability |
||
seed | i | seed for random operations | |
connections | cc | number of connections randomly picks the specified number of connection pairs allows for explicit specification of desired connections if a list is given connections have to specified as "startid;endid" only guarantees number of attempted connections, not final number |
|
con_radius |
r |
analogous to radius, used for connections | |
con_step | analogous to step, used for connections | ||
con_dens_factors |
cf |
False | analogous to dens_factors, used for connections |
con_momentum | analogous to momentum, used for connections | ||
con_min_dist |
o |
factor on the minimal distance between a connection and other cores | |
con_distribution |
dist |
distributions for random generation of data points for the cores of the connections | |
clu_ratios | ra | None | distribution of data points across clusters |
min_ratio |
ra |
minimal ratio of cores/points (only used when these are randomly determined through ra, should be |
|
ratio_noise |
ra |
ratio of noise data points | |
ratio_con |
ra |
ratio of connection data points | |
square | False | generate noise in square data space | |
random_start | rs | False | whether to start at a random position (True) or to start between two different clusters’ random cores (False) |
verbose | v | False | Generate text outputs at key events |
safety | s | True | forces noise generation outside of two times core radius |
Parameter | Abrev. | Default | Role |
---|---|---|---|
data_num | n | number of data points | |
non_zero | d_nz | False | guarantee at least one point per core |
center | d_c | False | place first data point of core at center of core |
equal | d_e | False | attempts to distribute points as equally as possible across cluster cores of the same cluster |
seed | d_i | None | updates generator seed if None, maintains current data/skeleton generator seed |
The core skeleton provided by DENSIRED allows for a setup for the stream data and offers user control over changes on a cluster and even a sub-cluster level. A simple continuous random sampling of new data points from all clusters according to the specified data ratios is, of course, possible. However, DENSIRED offers the user the ability to specify which clusters and even which cluster core to draw from.
To do so, the user has to specify a string parameter that describes the desired behavior of the data stream. The generator supports one stream at a time and can be called as an iterator to get the next value from the stream. If the specification string is changed, the stream is reset back to the start.
The stream is split into specific blocks. These blocks are separated by "|" in the string. The stream iterator remains inside a block for a set amount of data points (either based on a default value or a specified number). The number of points to generate is specified by using a numeric value đť‘Ą between braces "{x}". Each block has a user-defined set of clusters and cores that the new data point can be drawn from.
The clusters are set by their numeric identifier, separated from other clusters by "#". Within each cluster, individual cores can be specified. This is done by denoting the core identifiers inside square brackets "[x]". This supports a comma-separated list of individual identifiers, an end-inclusive range of identifiers denoted by a colon, as well as a combination of both.
By using regular brackets, it is possible to denote repetitions as well. These can be nested and their repetition number is designated with braces as with the point numbers for individual blocks. If no number is specified, the repetition count defaults to repeating the loop content once, meaning that the sequence will occur twice. The exception to this is when the repetition without a repetition number is the final part of the string. In that case, the loop is instead repeated endlessly. If there is still a block at the end of the string after the last repetition, the stream stays in that block. Should the final element be a repetition with a fixed count, the entire stream sequence is looped instead to maintain the repetition number.
Points are generated from the cores of the current block. The weights associated are drawn from the given parameters. The noise ratio is maintained if is added to a block to designate the section as a noisy one.
If a connection is requested by calling its identifier, it also follows the connection ratio, with the ratio being split between all connections based on their core numbers during the block. The clusters are weighted according to the cluster ratios when disregarding any cluster that does not have a core in the block. There is no reduced weight for clusters with fewer cores.
To visualize the stream and to easily make adjustments, it is possible to request the data generator to display the behavior for the stream, which will provide a visualization of the skeleton of the block in each overall stream step. It is especially advisable to display the stream behavior before generating the data when individual cores were chosen to ensure the desired cores were selected
The 'stretched' datasets (datasets/high_data_{dim}.npy), which were used in the paper, were generated using the following code. The results for the various algorithms on the 'stretched' datasets used for Table 3 and Figure 3 (and Figures 2,3 and 4 as well as Table 1 of the supplementary material) can be found in results/high_data_results.
for dim in [2,5,10,50,100]:
x = datagen.densityDataGen(dim=dim, ratio_noise = 0.1, max_retry=5, dens_factors=[1,1,0.5, 0.3, 2, 1.2, 0.9, 0.6, 1.4, 1.1], square=True,
clunum= 10, seed = 6, core_num= 200, momentum=0.8, step=1.5,
branch=0.1, star=1, verbose=False, safety=False, domain_size = 20, random_start=False)
data = x.generate_data(5000)
datax = data[:,0:-1]
max = np.max(datax) - np.min(datax)
for d in range(len(datax[0])):
datax[:,d] = (datax[:,d] - np.min(datax[:,d]))/max
datay = data[:,-1]
The results of the intrinsic dimensionality benchmarking used for Fig.5 are in intrinsic_dim_100_500_1000_merged.csv. The data for it was generated by the following code:
def _norm_data(dataset):
dataset = dataset - np.min(dataset)
dataset = dataset / np.max(dataset)
return dataset
branch = 0.05
star = 0
shift = 1
stickiness = 0.5
for seed in [0,1,2,3,4,5,6,7,8,9]:
for core_num in [100, 500, 1000]:
parameter_for_loop:
x = datagen.densityDataGen(dim=dim, ratio_noise=0, max_retry=5, dens_factors=1, square=True,
clunum=1, seed=seed, core_num=core_num,
star=star, momentum=momentum, branch=branch, step=step,
verbose=False, safety=False, domain_size=20, random_start=False)
data = x.generate_data(5000*core_num/100)
datax = data[:, 0:-1]
datax = _norm_data(datax)
pca = PCA().fit(datax)
cumsum = np.cumsum(pca.explained_variance_ratio_)
int_dim90 = np.argwhere(cumsum >= 0.9)[0][0] + 1
int_dim95 = np.argwhere(cumsum >= 0.95)[0][0] + 1
int_dim99 = np.argwhere(cumsum >= 0.99)[0][0] + 1
int_dim75 = np.argwhere(cumsum >= 0.75)[0][0] + 1
int_dim50 = np.argwhere(cumsum >= 0.50)[0][0] + 1
int_dim25 = np.argwhere(cumsum >= 0.25)[0][0] + 1
Here the 'parameter_for_loop' was one of the following for the four evaluated parameters:
for branch in [0, 0.0001, 0.001, 0.01, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]:
for momentum in [0, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.92, 0.95, 0.99, 1]:
for star in [0, 0.2, 0.4, 0.6, 0.8, 1]:
for step in [1, 1.25, 1.5, 1.75, 2]:
The "mix"-setting 2-d evaluation dataset (datasets/new_data_scaled_2.npy) used for Figures 2,3 and 6 of the supplementary material was generated with the following code. An unscaled version from directly after the generate_data call is included as datasets/new_data_2.npy.
x = datagen.densityDataGen(dim=2, ratio_noise = 0.1, max_retry=5, dens_factors=[1,1,0.5, 0.3, 2, 1.2, 0.9, 0.6, 1.4, 1.1], square=True,
clunum= 10, seed = 6, core_num= 200, momentum=[0.5, 0.75, 0.8, 0.3, 0.5, 0.4, 0.2, 0.6, 0.45, 0.7],
branch=[0,0.05, 0.1, 0, 0, 0.1, 0.02, 0, 0, 0.25],
con_min_dist=0.8, verbose=True, safety=True, domain_size = 20, random_start=False)
data = x.generate_data(5000)
datax = data[:,0:-1]
max = np.max(datax) - np.min(datax)
for d in range(len(datax[0])):
datax[:,d] = (datax[:,d] - np.min(datax[:,d]))/max
datay = data[:,-1]
The algorithm results for this setting can be found in the folder results/new_data_scaled_results. The .csv files contain the actual performance results for the algorithms. The best_labels.npy files contain the labels of the best-performing clustering for the respective seeds for each algorithm. The best performance was considered to be the highest sum of the NMI and ARI. In the case of ties, the earlier label set was kept.
The datasets of the 'compact'-setting (datasets/low_data_{dim}.npy) used for Figures 2,3 and 5 and Table 3 of the supplementary material were generated using the following code. The results for the various algorithms on the 'compact' datasets can be found in results/low_data_results.
for dim in [2,5,10,50,100]:
x = datagen.densityDataGen(dim=dim, ratio_noise = 0.1, max_retry=5, dens_factors=[1,1,0.5, 0.3, 2, 1.2, 0.9, 0.6, 1.4, 1.1], square=True,
clunum= 10, seed = 6, core_num= 200, momentum=0, step=1,
branch=0, star=0, verbose=False, safety=False, domain_size = 20, random_start=False)
data = x.generate_data(5000)
datax = data[:,0:-1]
max = np.max(datax) - np.min(datax)
for d in range(len(datax[0])):
datax[:,d] = (datax[:,d] - np.min(datax[:,d]))/max
datay = data[:,-1]
The seed_spreader code is based on the description provided in Junhao Gan and Yufei Tao. "DBSCAN revisited: Mis-claim, un-fixability, and approximation.", Proceedings of the 2015 ACM SIGMOD international conference on management of data. 2015, as well as Junhao Gan and Yufei Tao. "On the Hardness and Approximation of Euclidean DBSCAN", ACM Transactions on Database Systems (TODS), vol. 42, no. 3, pp. 1–45, 2017. The var_density flag swaps the generator to the Varying-density setting described in the 2017 paper.
It can be used like this:
Parameter | Abrev. | Default | Role |
---|---|---|---|
n | n | 2000000 | number of data points |
dim | d | 5 | dimensionality |
ratio_noise |
|
0.001 | noise ratio |
domain_size | domain size | ||
reset_counter | 100 | local counter for hypersphere points | |
restart_chance_mult | 0.001 | cluster restart overall roughly the desired cluster number, though not the actual cluster number corresponds to |
|
radius | 100 | radius of hypersphere | |
step | False | overwrite value for the step size if set to False the step size ( |
|
noise_adapt | False | adapt noise to altered domain size after random walk (not part of the description provided by Junhao Gan and Yufei Tao) | |
var_density | False | enables density-based mode introduced in "On the Hardness and Approximation of Euclidean DBSCAN" | |
seed | 0 | seed of randomness | |
verbose | False | Generate text outputs at key events |
The density variance is set to
data = seedSpreader(dim=2, noise_adapt=True, var_density = True)
A requirements file for datagen.py and seed_spreader.py has been generated through pip freeze and is provided here. The used Python version was Python 3.9. Other python versions may need different versions of the packages. The required packages are:
numpy
matplotlib
If you use our code in your research or applications, please consider citing our paper.
Due to a bug in the implementation, the distributions used for data generation in the paper is not uniform, but has some bias towards the center of the core.
This bug has since been resolved, but the old distribution can be used by setting paper
as the parameter value for distributions or con_distribution.
@inproceedings{jahn24densired,
author = {Philipp Jahn and
Christian M. M. Frey and
Anna Beer and
Collin Leiber and
Thomas Seidl},
title = {Data with Density-Based Clusters: {A} Generator for Systematic Evaluation
of Clustering Algorithms},
booktitle = {{ECML/PKDD} {(7)}},
series = {Lecture Notes in Computer Science},
volume = {14947},
pages = {3--21},
publisher = {Springer},
year = {2024}
}