Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning for biggest cluster #1131

Merged

Conversation

HealthyPear
Copy link
Member

In an effort to minimize the use of external libraries in protopipe (see issues cta-observatory/protopipe#5 and cta-observatory/protopipe#6), I implemented a function to clean just for the biggest cluster in an image.
I also implemented the associated unit test, which returned a positive result during my local tests.

In the definition of the function, there is a numpy array conversion which I propose to move into the recently added function number_of_islands.

@@ -355,3 +355,43 @@ def fact_image_cleaning(
geom, pixels_to_keep, arrival_times, min_number_neighbors, time_limit
)
return pixels_to_keep


def biggest_island(geom, image, mask):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to make this smaller. Give this function just the labels and return a new mask with fewer Trues, thus this function gets smaller and is more general. Also num_islands will be calculated twice if you also use it in the regular analysis.

So I would change this to:

def largest_island(labels):
    return labels == np.argmax(np.bincount(labels[labels > 0]))  # no need for where

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this input.
Let me just understand better:

  • I don't understand the "fewer Trues" part, for me the only Trues are always those related to the largest island and the mask reflects always the original geometry,

  • I don't understand why number_of_islands, should be called twice in the regular analysis (do you mean protopipe?)

  • you suggest that in the regular analysis I call first number_of_islands, then largest_island and finally I select the geometry and image instead of doing everything inside biggest_island,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Yes, this was a bit confusing, I meant return a mask where only the pixels in the largest group are True
  • number_of_islands is a useful paramter for G/H separation, that should be stored into DL1.2, so if you call it inside largest_island without being able to get the number of islands, you need to call it again to be able to store the value.
  • Yes, exactly. Small, composable functions!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Yes, this was a bit confusing, I meant return a mask where only the pixels in the largest group are True

This was already happening in my function, am I wrong? biggest_mask is a mask with the same dimensions of the original camera, but only the pixels belonging to the biggest island are set to True (biggest_mask = labels == biggest_label).

  • number_of_islands is a useful paramter for G/H separation, that should be stored into DL1.2, so if you call it inside largest_island without being able to get the number of islands, you need to call it again to be able to store the value.

OK, I understand now.

  • Yes, exactly. Small, composable functions!

Yes, I agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why number_of_islands, should be called twice in the regular analysis (do you mean protopipe?)

For example, we will include the number of islands in the DL1 parameters (so we should avoid computing it twice)

"""
# Finds all islands from the cleaned pixels and label them
num_islands, labels = number_of_islands(geom, mask)
labels = labels.astype("int64") # this is temporaneus: I think it's better
Copy link
Member

@maxnoe maxnoe Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? What is the dtype before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my tests I could see that it was float64, so then when I used numpy.argmax, it would say that it was expecting int64 instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then I agree that the Change should be made in num_islands directly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, for me the dtype is int32, so no reason to change this. Where did you encounter floats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it in a separate jupyter notebook.
I placed the first 4 prints inside biggest_island, just after the call to number_of_islands, in particular, the last one is,

print(f"dtype of labels is {labels.dtype}")

I then remove the line in which I forcibly convert the labels to int64.
The output of the prints is the following:

There are 4 islands.
Island labels:
[1. 1. 0. 0. 2. 2. 2. 2. 0. 3. 3. 0. 0. 0. 4. 4. 4.]
dtype of labels is float64

Then I get the following error when trying to select the biggest islands' label,

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-51e0d6b9b8e0> in <module>
     21 
     22 # Apply function to test
---> 23 biggest_cam, biggest_image = biggest_island(cam, image, clean_mask)
     24 
     25 # Check if the function recovers the truth

<ipython-input-9-b8099df578f5> in biggest_island(geom, image, mask)
     72 
     73     # Get the label of the biggest island and create a new mask
---> 74     biggest_label = np.argmax(np.bincount(labels[np.where(labels>0)]))
     75     print(f"The biggest one is #{biggest_label}.")
     76     biggest_mask = labels == (biggest_label)

<__array_function__ internals> in bincount(*args, **kwargs)

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

Copy link
Member Author

@HealthyPear HealthyPear Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ones from the unit test:

    # Create a simple camera
    cam = CameraGeometry.make_rectangular(17, 1)
    cam.cam_id = "test"  # this overwrites the default -1 set by the function
    # Create a simple image made of ones and zero "photoelectrons"
    image = np.array([2, 1, 1, 1, 1, 2, 2, 1, 0, 2, 1, 1, 1, 0, 2, 2, 2])
    # Let's say that we clean with a simple tailut with the following parameters
    # p = 2, b = 1, min neighbors=0
    # This cleaning would find the following islands
    clean_mask = np.zeros(cam.n_pixels).astype("bool")
    clean_mask[[0, 1]] = 1
    clean_mask[[4, 5, 6, 7]] = 2  # this is the biggest
    clean_mask[[9, 10]] = 3
    clean_mask[[14, 15, 16]] = 4

Then I call the function and get the error,

biggest_cam, biggest_image = biggest_island(cam, image, clean_mask)

Copy link
Member Author

@HealthyPear HealthyPear Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clear things up, the following is a minimal example:

cam = CameraGeometry.make_rectangular(5, 1)
image = np.array([0,0,1,1,1])
clean_mask = np.zeros(cam.n_pixels).astype("bool")
clean_mask[[2, 3, 4]] = 1 # the only and largest island
num_islands, labels = number_of_islands(cam, clean_mask)
print(f"There are {num_islands} islands.")
print(f"Islands labelled : {labels}")
print(f"dtype of labels is {labels.dtype}")

which returns the following output:

There are 1 islands.
Islands labelled : [0. 0. 1. 1. 1.]
dtype of labels is float64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure, I just replicated this result using only the conda version of ctapipe 0.7.

If it can be of some help, in this case when importing number_of_islands I get the following NumPy warning,

/Users/michele/Applications/anaconda3/envs/protopipe/lib/python3.7/site-packages/corsikaio/subblocks/dtypes.py:20: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  return np.dtype(dict(**dt))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found it. Can you add, dtype='int32' to the call to np.zeros in num islands?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in this way the problem is solved at the very beginning.

@codecov
Copy link

codecov bot commented Oct 14, 2019

Codecov Report

Merging #1131 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1131      +/-   ##
==========================================
+ Coverage   86.51%   86.54%   +0.03%     
==========================================
  Files         185      185              
  Lines       11627    11655      +28     
==========================================
+ Hits        10059    10087      +28     
  Misses       1568     1568              
Impacted Files Coverage Δ
ctapipe/image/cleaning.py 96.15% <100.00%> (+0.20%) ⬆️
ctapipe/image/tests/test_cleaning.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86cec54...ad3990e. Read the comment docs.

kosack
kosack previously approved these changes Jan 30, 2020
@HealthyPear HealthyPear requested a review from maxnoe March 25, 2020 10:14
Copy link
Member

@maxnoe maxnoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HealthyPear
Copy link
Member Author

LGTM

I had to go on Urban Dictionary....I feel so old.

@maxnoe
Copy link
Member

maxnoe commented Mar 25, 2020

LGTM
I had to go on Urban Dictionary....I feel so old.

I don't think that's an "age thing". More a "used to code review" thing

@kosack kosack merged commit 5aea005 into cta-observatory:master Mar 25, 2020
@HealthyPear HealthyPear deleted the feature-biggest_island_cleaning branch April 9, 2021 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants