-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleaning for biggest cluster #1131
Cleaning for biggest cluster #1131
Conversation
ctapipe/image/cleaning.py
Outdated
@@ -355,3 +355,43 @@ def fact_image_cleaning( | |||
geom, pixels_to_keep, arrival_times, min_number_neighbors, time_limit | |||
) | |||
return pixels_to_keep | |||
|
|||
|
|||
def biggest_island(geom, image, mask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to make this smaller. Give this function just the labels and return a new mask with fewer Trues
, thus this function gets smaller and is more general. Also num_islands
will be calculated twice if you also use it in the regular analysis.
So I would change this to:
def largest_island(labels):
return labels == np.argmax(np.bincount(labels[labels > 0])) # no need for where
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this input.
Let me just understand better:
-
I don't understand the "fewer Trues" part, for me the only Trues are always those related to the largest island and the mask reflects always the original geometry,
-
I don't understand why number_of_islands, should be called twice in the regular analysis (do you mean protopipe?)
-
you suggest that in the regular analysis I call first number_of_islands, then largest_island and finally I select the geometry and image instead of doing everything inside biggest_island,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, this was a bit confusing, I meant return a mask where only the pixels in the largest group are
True
- number_of_islands is a useful paramter for G/H separation, that should be stored into DL1.2, so if you call it inside largest_island without being able to get the number of islands, you need to call it again to be able to store the value.
- Yes, exactly. Small, composable functions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, this was a bit confusing, I meant return a mask where only the pixels in the largest group are
True
This was already happening in my function, am I wrong? biggest_mask is a mask with the same dimensions of the original camera, but only the pixels belonging to the biggest island are set to True (biggest_mask = labels == biggest_label
).
- number_of_islands is a useful paramter for G/H separation, that should be stored into DL1.2, so if you call it inside largest_island without being able to get the number of islands, you need to call it again to be able to store the value.
OK, I understand now.
- Yes, exactly. Small, composable functions!
Yes, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why number_of_islands, should be called twice in the regular analysis (do you mean protopipe?)
For example, we will include the number of islands in the DL1 parameters (so we should avoid computing it twice)
ctapipe/image/cleaning.py
Outdated
""" | ||
# Finds all islands from the cleaned pixels and label them | ||
num_islands, labels = number_of_islands(geom, mask) | ||
labels = labels.astype("int64") # this is temporaneus: I think it's better |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? What is the dtype before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my tests I could see that it was float64, so then when I used numpy.argmax, it would say that it was expecting int64 instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then I agree that the Change should be made in num_islands directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked, for me the dtype is int32, so no reason to change this. Where did you encounter floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it in a separate jupyter notebook.
I placed the first 4 prints inside biggest_island, just after the call to number_of_islands, in particular, the last one is,
print(f"dtype of labels is {labels.dtype}")
I then remove the line in which I forcibly convert the labels to int64.
The output of the prints is the following:
There are 4 islands.
Island labels:
[1. 1. 0. 0. 2. 2. 2. 2. 0. 3. 3. 0. 0. 0. 4. 4. 4.]
dtype of labels is float64
Then I get the following error when trying to select the biggest islands' label,
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-51e0d6b9b8e0> in <module>
21
22 # Apply function to test
---> 23 biggest_cam, biggest_image = biggest_island(cam, image, clean_mask)
24
25 # Check if the function recovers the truth
<ipython-input-9-b8099df578f5> in biggest_island(geom, image, mask)
72
73 # Get the label of the biggest island and create a new mask
---> 74 biggest_label = np.argmax(np.bincount(labels[np.where(labels>0)]))
75 print(f"The biggest one is #{biggest_label}.")
76 biggest_mask = labels == (biggest_label)
<__array_function__ internals> in bincount(*args, **kwargs)
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ones from the unit test:
# Create a simple camera
cam = CameraGeometry.make_rectangular(17, 1)
cam.cam_id = "test" # this overwrites the default -1 set by the function
# Create a simple image made of ones and zero "photoelectrons"
image = np.array([2, 1, 1, 1, 1, 2, 2, 1, 0, 2, 1, 1, 1, 0, 2, 2, 2])
# Let's say that we clean with a simple tailut with the following parameters
# p = 2, b = 1, min neighbors=0
# This cleaning would find the following islands
clean_mask = np.zeros(cam.n_pixels).astype("bool")
clean_mask[[0, 1]] = 1
clean_mask[[4, 5, 6, 7]] = 2 # this is the biggest
clean_mask[[9, 10]] = 3
clean_mask[[14, 15, 16]] = 4
Then I call the function and get the error,
biggest_cam, biggest_image = biggest_island(cam, image, clean_mask)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clear things up, the following is a minimal example:
cam = CameraGeometry.make_rectangular(5, 1)
image = np.array([0,0,1,1,1])
clean_mask = np.zeros(cam.n_pixels).astype("bool")
clean_mask[[2, 3, 4]] = 1 # the only and largest island
num_islands, labels = number_of_islands(cam, clean_mask)
print(f"There are {num_islands} islands.")
print(f"Islands labelled : {labels}")
print(f"dtype of labels is {labels.dtype}")
which returns the following output:
There are 1 islands.
Islands labelled : [0. 0. 1. 1. 1.]
dtype of labels is float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be sure, I just replicated this result using only the conda version of ctapipe 0.7.
If it can be of some help, in this case when importing number_of_islands I get the following NumPy warning,
/Users/michele/Applications/anaconda3/envs/protopipe/lib/python3.7/site-packages/corsikaio/subblocks/dtypes.py:20: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
return np.dtype(dict(**dt))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found it. Can you add, dtype='int32' to the call to np.zeros in num islands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in this way the problem is solved at the very beginning.
Codecov Report
@@ Coverage Diff @@
## master #1131 +/- ##
==========================================
+ Coverage 86.51% 86.54% +0.03%
==========================================
Files 185 185
Lines 11627 11655 +28
==========================================
+ Hits 10059 10087 +28
Misses 1568 1568
Continue to review full report at Codecov.
|
…corrections to number_of_islands docstring.
…HealthyPear/ctapipe into feature-biggest_island_cleaning
…surviving the cleaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I had to go on Urban Dictionary....I feel so old. |
I don't think that's an "age thing". More a "used to code review" thing |
In an effort to minimize the use of external libraries in protopipe (see issues cta-observatory/protopipe#5 and cta-observatory/protopipe#6), I implemented a function to clean just for the biggest cluster in an image.
I also implemented the associated unit test, which returned a positive result during my local tests.
In the definition of the function, there is a numpy array conversion which I propose to move into the recently added function number_of_islands.