Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA data loader for alembic testing #3273

Merged
merged 11 commits into from
May 24, 2018
Merged

QA data loader for alembic testing #3273

merged 11 commits into from
May 24, 2018

Conversation

heartsucker
Copy link
Contributor

@heartsucker heartsucker commented Apr 16, 2018

Status

Ready for review

Description of Changes

Toward #3244

Adds a master dataloader that dumps data in to the database / storage dir to allow us to test migrations.

Testing

Spin up a staging machine and ./qa_loader.py

Checklist

If you made changes to the server application code:

  • Linting (make ci-lint) and tests (make -C securedrop test) pass in the development container

@heartsucker heartsucker requested a review from conorsch as a code owner April 16, 2018 15:25
@heartsucker heartsucker requested a review from a user April 16, 2018 15:25
@heartsucker
Copy link
Contributor Author

Note that this doesn't use factory_boy as suggested earlier because when i did it that way, it was honestly way more obtuse and hard to understand than just using a few hand-rolled functions.

@codecov-io
Copy link

codecov-io commented Apr 16, 2018

Codecov Report

Merging #3273 into alembic will increase coverage by 0.17%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           alembic    #3273      +/-   ##
===========================================
+ Coverage    85.14%   85.32%   +0.17%     
===========================================
  Files           36       37       +1     
  Lines         2201     2330     +129     
  Branches       239      257      +18     
===========================================
+ Hits          1874     1988     +114     
- Misses         269      283      +14     
- Partials        58       59       +1
Impacted Files Coverage Δ
qa_loader.py 88.37% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c04c22...136df17. Read the comment docs.

Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @heartsucker - I did some testing and investigating in staging VMs - dropping a few comments inline. Note that I haven't yet been able to get a clean upload of data, but I'll return to this after a round of 0.7.0 QA and resume where I left off


def new_journalist():
# Make a diceware-like password
pw = ' '.join([random_chars(3, nullable=False) for _ in range(7)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, since the random characters here can be \n or \t, we can get something like:

password = '>e, \n\n\t ]ec olq <um v~_ xl)'

which has len(password.split()) < 7 and thus will raise a NonDicewarePassword exception.

Instead, you could pick from this list of characters such that no NonDicewarePassword exception is raised



def new_source():
fid_len = random.randint(0, 32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get a fid_len of 0, we will start getting IntegrityErrors since we have a unique constraint on filesystem_id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we also want to avoid very small non-zero fid_len as there is a high probability of hitting IntegrityError for them also (I can attest for this random seed fid_len = random.randint(4, 32) worked no problem)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included empty FID's because that's allowed by the DB and could (but very, very likely won't) cause problems during future migrations.

description='Loads data into the database for testing upgrades')
parser.add_argument('-m', '--multiplier', type=positive_int, default=100,
help=('Factor to multiply the loaded data by '
'(default 100)'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet!

cleanly in production-like instances, we have a helper script that is designed to load
semi-randomized data into the database. You will need to modify the script ``qa_loader.py`` to
include sample data. This sample data should intentionally include edge cases that might behave
strangely such as data whose nullability is only enforced by the application or missing files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For release managers and release testers, let's add a bit of docs saying:

  1. Provision staging or prod VMs.
  2. sudo su
  3. cd /var/www/securedrop
  4. ./qa_loader.py


filename = random_chars(20, nullable=False, chars=string.ascii_lowercase)
filename = '1-' + filename + '-msg.gpg'
f_len = random.randint(1, 1024*1024*500)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you're doing this, but sadly this produces a bunch of very large files and my test VMs ran out of space very rapidly 😇.

What about reducing the number of files that have very large sizes? e.g. by drawing from a steep exponential distribution int(math.floor(random.expovariate(10.0) * 1024 * 1024 * 500))? We'll get mostly smaller files but (with low probability) a few bigger ones thrown in - this should significantly reduce disk space needed and still be a realistic test (note: I haven't tested this yet and it likely needs tweaking)

@heartsucker
Copy link
Contributor Author

@redshiftzero Can you review this locally? I'm getting merge conflicts with develop during CI, and I'd rather just deal with that rebase once on the alembic branch.

@redshiftzero redshiftzero self-assigned this May 21, 2018
Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run through this successfully in staging VMs on top of the commits I added, so this is shaping up nicely. Details of my changes are in the commit messages, let me know if you disagree with any of them - the premise to my modifications here is that a person doing QA should be able to both smoothly run the QA loader and also be able to access the journalist interface and click around and not see any 500s. If you haven't done that with this QA data, you definitely should prior to us merging this as there are a couple of other things that may seem odd to testers, e.g. the random journalist designations, take a look at let me know what you think. It's important we resolve this now because we'll otherwise have to resolve it during the QA period if something is confusing.

There's one last case to handle (#1189), there is a comment inline about that, fortunately I think there is a nice path forward there 😄

3. Provision staging VMs
4. ``vagrant ssh app-staging``
5. ``sudo su``
6. ``cd /var/www/securedrop && ./qa-loader.py``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ./qa-loader.py -> ./qa_loader.py

SOURCE_COUNT * multiplier + multiplier):
new_abandoned_submission(config, sid)

# TODO add submissions without sources for #1189
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want this - new_abandoned_submission almost handles this situation, it looks like we just need to delete the source(s) that the abandoned submissions are associated with. Using raw SQL to do the deletion should do the trick since the submission cascade deletion is enforced at the ORM level.

heartsucker and others added 5 commits May 23, 2018 16:01
To do so, they must have pending set to False (by default it is
True, and such sources will not appear in the journalist interface),
as well as a filesystem_id that is not None. Any source that has
a filesystem_id of None will cause a 500 server error, so this is
a reasonable assumption to make for existing instances.
Since we generate a fake file for replies and submissions, we
should limit the number of replies generated per iteration to 3
(previously it was 100). In addition, we need to use a very steep
file size distribution, or the total size of the files generated
is unwieldly (i.e. fills up the staging VM).
Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes look good! I added one commit (ca0ffb5) to simplify the dangling submissions case a bit. I verified this dangling submissions case via models.db.session.query(models.Submission).filter_by(source_id=None).all() to see the submissions are present as expected.

Approved, and I'll let you do the merge honors here since the feature branch is not protected and I made the last change.

@heartsucker
Copy link
Contributor Author

I just noticed in the UI that you can't add emoji/reactions to a review. Only commends. So.

👍👍👍👍👍👍👍👍👍👍👍👍👍👍👍

@heartsucker heartsucker merged commit 5689235 into alembic May 24, 2018
@heartsucker heartsucker deleted the qa-data-loader branch May 24, 2018 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants