Issues with CluStream clustering #1633
PhilJahn
started this conversation in
Show and tell
Replies: 1 comment 2 replies
-
Good stuff, @PhilJahn! I am pinging @hoanganhngo610 on this one :) |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I was working with CluStream together with my colleague yesterday. We noticed that the CluStream clustering was inconsistent with the behavior of kMeans. An example on complex-9 is noted below. Grey dots with orange rims are the micro-cluster centers (based on micro_clusters[index].center), and colored dots with red rims correspond to the respective kMeans centroids. (For example, most noticeably the out-of-place light green data points in the top-left and turquoise ones in the bottom-center-left)
![grafik](https://private-user-images.githubusercontent.com/25903816/386634426-4e9e0549-74b9-47e0-b2ab-2a3cbab0ee3b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyNDMxNTEsIm5iZiI6MTczOTI0Mjg1MSwicGF0aCI6Ii8yNTkwMzgxNi8zODY2MzQ0MjYtNGU5ZTA1NDktNzRiOS00N2UwLWIyYWItMmEzY2JhYjBlZTNiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDAzMDA1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNlNjU3NDM0ZTQyZmI1ZjI5ZjA3NmZlZThlZTIxOGNjMDcxMzA4ODAxZDY4NWQ2Zjk2YzllYjg0MzczMWQyMzkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Y6EeAOIp2Bm9qPmwIW0w6MSAFv2pexoNleOiUG3lH44)
![grafik](https://private-user-images.githubusercontent.com/25903816/386636601-fc4bd7f5-d567-4197-8f1c-8c30d1f84772.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyNDMxNTEsIm5iZiI6MTczOTI0Mjg1MSwicGF0aCI6Ii8yNTkwMzgxNi8zODY2MzY2MDEtZmM0YmQ3ZjUtZDU2Ny00MTk3LThmMWMtOGMzMGQxZjg0NzcyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDAzMDA1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJkZDRjNzhmNzE4Y2Y3ZTE0NTIyM2UyZDNlMzFjNWUzZWFkMzllODUxN2QwM2FkODJlMTljZTliZjZkYWVmMzQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.RO-XaTlZ2eIG8PFmOil0XnMEDcn2N4RVB5Bz2P6R_t8)
While investigating this, I noticed a mismatch between the micro-cluster centers used for computing the nearest micro-cluster (self.micro_clusters[index].center) and the one used for the cluster assignment (self._mc_centers[index]). The latter is only updated when performing kMeans clustering, whereas the former is kept up-to-date. This causes problems when a micro-cluster is deleted/merged as the index of the nearest micro-cluster may instead refer to the center of a deleted/merged micro-cluster, which could correspond to a very different position and, thus, a different kMeans cluster. (Same to a lesser degree for movements of a micro-cluster between kMeans steps).
This can be fixed by using self.micro_clusters[index].center for the cluster assignment.
I also noticed a second, presumably unintended behavior where the kMeans step can be skipped if the data point that arrives at the "% self.time_gap == self.time_gap - 1" timestamp is added to an existing micro-cluster. This can cause the clustering to be either outdated or, in the worst case, not be performed at all.
I have addressed both in my fork of the repository: https://github.com/PhilJahn/PhilJahnLMU-river-fork. I have opened up a pull request for this here: #1634
Related to this, I also want to suggest giving the option to perform the kMeans clustering as part of the prediction phase, as this seems to be closer to the idea of Online-Offline Stream clustering. This would also resolve the first issue above (at least when the option is used) as it would also keep _mc_centers[index] up-to-date regarding the predictions. I have also implemented this in a branch of my fork of the river-repository (https://github.com/PhilJahn/PhilJahnLMU-river-fork/tree/Offline-CluStream), but have not done documentation for it as I wanted to wait for feedback regarding the suggestion before doing so.
Beta Was this translation helpful? Give feedback.
All reactions