Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C11 atomic lock alignment in data_t #685

Merged
merged 2 commits into from
Oct 24, 2024

Conversation

abouteiller
Copy link
Contributor

c11 atomic_flag can be shorter (bool) than the volatile int type used in atomic_external.h, leading to misaligned static structures (like data_t) so, when C11 atomics are enabled, we force the size of atomic_lock_t to be at least one int long, regardless, to match the expected alignment.

@abouteiller abouteiller added bug Something isn't working blocker Blocking release or critical use case labels Oct 24, 2024
@abouteiller abouteiller self-assigned this Oct 24, 2024
@@ -183,37 +183,42 @@ __int128_t parsec_atomic_fetch_add_int128(volatile __int128_t* l, __int128_t v)
#endif

/* Locks */

typedef volatile atomic_flag parsec_atomic_lock_t;
typedef union {
Copy link
Contributor

@bosilca bosilca Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly. What if you remove the volatile completely ? Or move it outside the union as in

typedef volatile union parsec_atomic_lock_u {
    atomic_flag flag;
    int atomic_external_aligner;
} parsec_atomic_lock_t;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the volatile is not useful I removed it and it passes all tests

@bosilca
Copy link
Contributor

bosilca commented Oct 24, 2024

I'm really puzzled by this. That a compiler can decide to implement the atomic_flag_t as bool or int, the alignment of the encompassing struct should not change. Moveover, as long as the update/check operations are always done with the same compiler, everything should just work. We always update such variables via accessors defined in parsec library, so this should never be an issue. So, what exactly is the problem here ?

@abouteiller
Copy link
Contributor Author

the next field in data_t is a uint8_t so it packs tightly with the bool, then the whole rest of the data_t structure is not aligned in memory in the same way when looked at from a file that includes atomic-external.h versus atomic-c11.h. That ends up causing crashes when we dereference fields in the data_t, and obtain garbage from fields with a differing offsetof. That can be seen for example when we declare data_t in the lapack reshaping functions in dplasma (they are in atomic-external packing shape), and then read the values of the data_t in parsec (then in atomic-c11 packed shape). This has been observed and confirmed in gdb, and the patch fixes this issue.

@bosilca
Copy link
Contributor

bosilca commented Oct 24, 2024

According to the code in data_internal.h the next field in the parsec_data_t is parsec_data_key_t aka uint64_t, aligned to 8.

@therault
Copy link
Contributor

Interesting / facetious :) We saw that behavior while debugging the PR #671 where indeed the parsec_data_key_t field is removed from the definition of parsec_data_t, making this bug manifest (https://github.com/ICLDisco/parsec/pull/671/files#diff-dbde8517930fb2ca1ac0555fc9e18cc92a7a282a97a2e20fa222c7fc1d73fd16) because then the next field is a int8_t.

IMHO, it is fragile to rely on the alignment of the next field in some structure to guarantee the types with atomic-external.h and without are compatible. I prefer the (admittedly not very aesthetic) solution of forcing the base type to be compatible with what we expose in atomic-external.h

atomic_external.h, leading to misaligned static structures (like data_t)
so, when C11 atomics are enabled, we force the size of atomic_lock_t to
be at least one int long, regardless, to match the expected alignment.
@abouteiller abouteiller force-pushed the bugfix/c11-atomic-lock branch from 90a03ce to d567b31 Compare October 24, 2024 21:21
@abouteiller abouteiller marked this pull request as ready for review October 24, 2024 21:24
@abouteiller abouteiller requested a review from a team as a code owner October 24, 2024 21:24
@bosilca bosilca merged commit 46d8d93 into ICLDisco:master Oct 24, 2024
8 checks passed
@abouteiller abouteiller deleted the bugfix/c11-atomic-lock branch October 24, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker Blocking release or critical use case bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants