Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn team about high failure rates #2667

Merged
merged 8 commits into from
Dec 9, 2019

Conversation

CrystalPea
Copy link
Contributor

@CrystalPea CrystalPea commented Dec 4, 2019

Log an exception and send a zendesk ticket when, in the last 24 hours:

  • there were services with failure rate over 25% for their sms with at least 100 messages sent
  • there were services that sent at least 100 sms messages to tv numbers

Story ticket: https://www.pivotaltracker.com/n/projects/1443052

@CrystalPea CrystalPea force-pushed the warn-team-about-high-failure-rates branch from 678cb48 to 180915d Compare December 4, 2019 14:04
Copy link
Contributor

@servingUpAces servingUpAces left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple small comments

Copy link
Contributor

@idavidmcdonald idavidmcdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the code but I might suggest adding a runbook entry to https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook telling what a developer should do when they receive this ticket and link to the runbook play from the zendesk ticket

current_app.logger.exception(message)

if current_app.config['NOTIFY_ENVIRONMENT'] in ['live', 'production', 'test']:
message += "\nThings to do: contact service? revoke their key?"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what TV numbers are:
These are numbers reserved for the use in TV and drama. The range of TV numbers are 07700 900000 to 900999. When an SMS notification is sent to such a number, we get a 'permanent-failure' status back and we are billed for this notification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That first sentence would be useful to put in the TODO section or in the runbook so a developer receiving the ticket knows how to deal with it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I wrote it to myself here only to remember about including it :). As it only transpired in a spoken conversation.

len(services_sending_to_tv_numbers)
)
for service in services_sending_to_tv_numbers:
message += "service id: {}, count of sms to tv numbers: {},\n".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the url for service id https://www.notifications.service.gov.uk/services/{service_id}

current_app.logger.exception(message)

if current_app.config['NOTIFY_ENVIRONMENT'] in ['live', 'production', 'test']:
message += "\nThings to do: contact service? revoke their key?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To start with we should investigate by looking at the dashboard, then contact the service about the large failure rate or use of tv numbers.

app/dao/services_dao.py Show resolved Hide resolved
@CrystalPea CrystalPea force-pushed the warn-team-about-high-failure-rates branch from 538be7d to 635c9cf Compare December 5, 2019 17:35
@CrystalPea
Copy link
Contributor Author

@servingUpAces I refactored this code as per our conversations now, what do you think?

@idavidmcdonald I created a section for this ticket in the Manual and linked it from the ticket.

@idavidmcdonald
Copy link
Contributor

@CrystalPea Excellent runbook entry :)

@@ -1099,3 +1071,79 @@ def create_email_sms_letter_template():
template_two = create_template(service=service, template_name='2', template_type='sms')
template_three = create_template(service=service, template_name='3', template_type='letter')
return template_one, template_three, template_two


@freeze_time("2019-12-02 12:00:00.000000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to freeze_time for this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do, to show that most notifications are in good date range, while notifications for service_6 are just a bit too old and do not make it to results.


@freeze_time("2019-12-02 12:00:00.000000")
def test_dao_find_services_sending_to_tv_numbers(notify_db_session, fake_uuid):
service_1 = create_service(service_name="Service 1", service_id=fake_uuid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use fake_uuid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an uuid, as we assert on it below, to check the result we get is for the right service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use the id from service_1.id

tests/app/dao/test_services_dao.py Show resolved Hide resolved
@CrystalPea CrystalPea force-pushed the warn-team-about-high-failure-rates branch from f42f20e to 846abca Compare December 6, 2019 16:39
@CrystalPea CrystalPea force-pushed the warn-team-about-high-failure-rates branch from 846abca to 08b12a6 Compare December 6, 2019 17:05
@CrystalPea CrystalPea merged commit 2019070 into master Dec 9, 2019
@CrystalPea CrystalPea deleted the warn-team-about-high-failure-rates branch December 9, 2019 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants