-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance GSON Converters, hash data only once, cache ULID generator class #137
base: main
Are you sure you want to change the base?
Enhance GSON Converters, hash data only once, cache ULID generator class #137
Conversation
By the way, a lot of the diff is caused by reformat (i'm an addict to ctrl+alt+l in IJ!), what do you think of opening a pull request to setup the maven formatter plugin ? |
6e5f534
to
bc8e871
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Override-6, thanks for the suggestions. I have split the review into two parts, taking on the high-level approach you've suggested which I'll discuss here, and some code-specific comments which I'll write down below.
Hashing in the SDK top-level calls
While I think that the hashing of a piece of data should be the role of the Annotator
since it is a field on the Annotation
being created, I can't deny that the performance gained from this change is significant enough for us to overlook this design choice. I think we should give this a greenlight.
Additionally, I would have hoped that we can replace the data
parameter in the annotator execute method to be the hash
parameter, but as you already probably know the PkiAnnotator
relies on the data, and is also the only annotator that does so. I think we should keep this in mind as it has been discussed that we want to remove the reliance of that annotator on the incoming data schema... just pinning a thought here.
Base64 Encoding Annotations in Publish Wrappers
I believe it's been discussed that breaking changes should not be entertained at this time. Other Alvarium services expect that the publish wrapper will contain the annotation list as a Base64-encoded value. This change should probably be reverted.
ULID Generation
I think this is on the right track, I am not a fan of static variables though. Do you think the same idea you have here can be approached differently ? I was thinking something along the lines of a ULID provider class, similar to how hashing and signing works. Interested to hear your ideas here.
Annotation Serialization
I think this is a straight up improvement on what was already there, just some implementation comments below.
import java.util.ArrayList; | ||
import java.util.HashMap; | ||
import java.util.List; | ||
|
||
import com.alvarium.annotators.Annotator; | ||
import com.alvarium.annotators.AnnotatorConfig; | ||
import com.alvarium.annotators.AnnotatorException; | ||
import com.alvarium.annotators.AnnotatorFactory; | ||
import com.alvarium.contracts.Annotation; | ||
import com.alvarium.contracts.AnnotationList; | ||
import com.alvarium.contracts.AnnotationType; | ||
import com.alvarium.hash.HashProviderFactory; | ||
import com.alvarium.hash.HashTypeException; | ||
import com.alvarium.streams.StreamException; | ||
import com.alvarium.streams.StreamProvider; | ||
import com.alvarium.streams.StreamProviderFactory; | ||
import com.alvarium.utils.ImmutablePropertyBag; | ||
import com.alvarium.utils.PropertyBag; | ||
|
||
import org.apache.logging.log4j.Logger; | ||
|
||
import java.util.ArrayList; | ||
import java.util.HashMap; | ||
import java.util.List; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to our style guide which contains a section for imports and the correct ordering and grouping. It's a combination of industry styling guides used by Google and Twitter.
I understand these changes are probably due to a formatter, but what maintainers here usually do is create configuration on the editor of choice prior to making changes that reflect what is in that style guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, why not trying to use a maven plugin to help apply the code style policy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why not, we can look into it.
src/main/java/com/alvarium/serializers/AlvariumPersistence.java
Outdated
Show resolved
Hide resolved
public class AlvariumPersistence { | ||
|
||
|
||
public static final Gson GSON = new GsonBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what you've seen in your tests, does just instantiating this object each time a serialization is made take a lot away from performance ? I can understand the reasoning behind making this static since it's not really a changing value, but if it doesn't affect performance I don't see why not instantiate a new AlvariumPersistence
each time we need to serialize something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesnt that much, I could create a getGson()
but a Gson object is not something we gonna mutate, so why not place it in a static place and access it.
I made the GSON instance private, and now expose three static functions from the classs, so i think placing it in a static field is even more adapted now :
public class PeristenceFunctions {
private static final Gson GSON = ...;
public static String serializeWrapper(PublishWrapper wrapper) {
return GSON.toJson(wrapper);
}
public static String serializeAnnotation(Annotation annotation) {
return GSON.toJson(annotation);
}
public static Annotation deserializeAnnotation(String jsonAnnotation) {
return GSON.fromJson(jsonAnnotation, Annotation.class);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the reason behind my suggestion was that we could just do something like this:
public class JSONSerializer<T> {
private final Gson gson;
public JSONSerializer() {
gson = new GsonBuilder()
.registerTypeAdapter(PublishWrapper.class, new PublishWrapperConverter())
.registerTypeAdapter(ZonedDateTime.class, new ZonedDateTimeConverter())
.registerTypeAdapter(Annotation.class, new AnnotationConverter())
.disableHtmlEscaping()
.create();
}
public String toJson(T serializable) {
return this.gson.toJson(serializable);
}
public T fromJson(String json, java.lang.Class<T> classOfT) { // maybe there is a way to skip the second parameter, I haven't dug deep enough to check
return this.gson.fromJson(json, classOfT);
}
}
and to use it we can just do new JSONSerializer<Annotation>().toJson(annotation)
or new JSONSerializer<Annotation>().fromJson(json, Annotation.class)
. No static instances of anything, and I assume no detriment to performance.
Thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two issues here :
-
The Gson is configured to be only used by Alvarium (due to the ZonedDateTime adapter that is specific to our use.), i don't see why we want to use a generic method here, as we will only need to support persistance operations for Annotation and PublishWrapper.
I think that constraining the JSONSerializer with more typed method (serializeWrapper
,serializeAnnotation
etc) helps to encapsulate the Gson, and so we ensure that we only use the serializer for what it is intended for (as we saw that using it for unknown structures leads to serious performances degradation). -
new JSONSerializer().toJson()
is semantically the same asJSONSerializer.toJson()
, as there is no possible way to mutate the wrapped Gson instance.
A Gson instance is thread safe for both serialization and deserialization.
I really dont see any advantage (in both semantic and runtime) in not using a static constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does make sense. I like the idea of enforcing the use of the serializer to only specific types in order to avoid performance issues, so I am sold on that part. But you do realize this will require a fair bit of coupling between the serializers package and any serializable class... Do you think it is possible to move the serialization logic as a member of the respective class ? i.e., something on the lines of String json = publishWrapper.toJson()
while keeping all performance improvements ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry i don't understand, do you want to move the functions of the PersistanceFunctions.serializeX
to X.toJson()
? If yes then the methods already exists in PublishWrapper
and Annotation
.
Or do you want to have a GSON instance in PublishWrapper and Annotation and then those gson are specialized to the class they are associated ? If yes then this would be like it is done currently, and this will force you to ensure that the changes you make are set in all Gsons (if you add an adpater then you'll need to not forget to place it in all Gson) this looks like code duplication i think 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you want to have a GSON instance in PublishWrapper and Annotation and then those gson are specialized to the they are associated ? If yes then would be like it is done currently, and this will force you to ensure that the changes you make are set in all Gsons (if you add an adpater then you'll need to not forget to place it in all Gson) this looks like code duplication i think 🤔
Yes, this is what I meant. I see the implication of code duplication here, and yes it would be a worse approach. I was checking to see if you have any ideas that can improve on the coupling across packages without any downsides, as I have none, but it seems this really is the best approach we have on hand at the moment.
My only gripe at the moment is with the class name still 😄 "Persistence" is confusing in a software development space, which is usually used for long-term persistence such as DBs and storage, and when someone reads PersistenceFunctions.serializeWrapper(wrapper)
, they might think the output is actually being stored somewhere, somehow. Just a suggestion, how about Serialization.toJson(wrapper)
which tells me that only serialization is happening as well as using which format. This makes extensibility to other formats easier as well.
public class Serialization {
private static final Gson GSON = new GsonBuilder()
.registerTypeAdapter(PublishWrapper.class, new PublishWrapperConverter())
.registerTypeAdapter(ZonedDateTime.class, new ZonedDateTimeConverter())
.registerTypeAdapter(Annotation.class, new AnnotationConverter())
.disableHtmlEscaping()
.create();
public static String toJson(PublishWrapper wrapper) {
return GSON.toJson(wrapper);
}
public static String toJson(Annotation annotation) {
return GSON.toJson(annotation);
}
public static Annotation annotationFromJson(String jsonAnnotation) {
return GSON.fromJson(jsonAnnotation, Annotation.class);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure no problem !
src/main/java/com/alvarium/serializers/AlvariumPersistence.java
Outdated
Show resolved
Hide resolved
src/main/java/com/alvarium/serializers/AnnotationConverter.java
Outdated
Show resolved
Hide resolved
Hey @Ali-Amin, Thanks for the review. Hashing in the SDK top-level callsFor me the All those things makes the implementation bother with additional operations it shouldnt, and this is why you had to make the hashing process for each annotation, or to inject the I think the responsibility to create ULID GenerationWhy not simply using FormatI think we could open a PR to set up a code formatter and configure it to apply the code style convention? |
4f31b02
to
9167729
Compare
9167729
to
b5374cb
Compare
Why not removing the In this scala sdk poc, I wanted to use a more typesafe way to inject things at "annotation evaluation time", maybe we could use it here, which could allow the PkiAnnotator to stay in the sdk! |
|
I think that might be a good approach yes, because the data is needed for the annotation's 'satisfaction' logic, and we usually pass info related to that 'satisfaction' logic in the property bag for other annotators. So it makes sense to me to do the same here. |
I understand for ULID, i forgot that changing the ID generation algorithm would also be a breaking change ! For simplicity's sake, i'm personally more in favor of keeping the ULID in a static field (I indeed could create a I think i'll open a new PR to setup a formatter
Do you want me to implement this idea here or do i create another PR for this (once this one will be merged) ? I am interested by what you mean by Envelope, do you have any thing i can read to better understand the idea ? Thanks you ! |
I think the advantages of an ID generator outweigh those of a static instantiation of ULID, having a default value for the ID generator config as well might be a good idea to prevent headache for other devs.
Awesome, looking forward to it.
I think this PR is large enough, lets keep any more changes and suggestions into their own single units to keep reviews simple and fast.
I do not have anything on hand unfortunately, I will try to fetch the TSC call recording link we had where we discussed plans for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of everything else looks good to me. As soon as this is fixed, I'll plug these changes into a working Jenkins pipeline and ensure scores are being calculated without error.
This recording https://wiki.lfedge.org/download/attachments/68584065/Alvarium-TSC-11-Mar-2024.mp4?api=v2 should have the relevant info, the discussion starts around 9:50. |
83c50cd
to
2859b54
Compare
2859b54
to
01cd65f
Compare
@Ali-Amin @Override-6 Thanks for the collaboration on this. Please have some patience w/r/t the timing of getting this work approved and merged. We have an effort in front of it that is proving difficult and I'd like to get that resolved before considering these changes. There is also some pending vacation time coming up which will add a delay. Just FYI that we're still tracking this PR. |
Hello,
This Pull Request implements the three non-breaking changes ideas as discussed in the report mentioned in #136 (section 4.1, page 6).
What has been improved
JSON Optimisation
The main improvement is by optimising GSON serialization.
In the report, I say that 25% of a data annotation process goes to serializing the annotations to JSON.
Avoid GSON reflexion serializer
The profiling results shows that the main part of this overhead is that Gson has to use the java reflection library to automatically serialize the annotations, as no
JsonSerializer
is properly registered in theGSON
context.Currently, the GSON library seems to be misused :
alvarium-sdk-java/src/main/java/com/alvarium/serializers/AnnotationConverter.java
Lines 28 to 37 in 7e53f04
By doing this, the
Annotation
is converted to a json string (usingAnnotation#toJson
), which is then parsed back to aJSONElement
.The implementation of
Annotation#toJson
is the following :alvarium-sdk-java/src/main/java/com/alvarium/contracts/Annotation.java
Lines 116 to 120 in 7e53f04
No converter is set to properly depict the structure of an annotation, so, in addition to this JSON indirection, Gson will have to use reflection to serialize, which is very unefficient. So I defined a JsonSerializer for the
Annotation
type to avoid this.InstantConverter and ZonedDateTime
I also saw that the InstantConverter's serialization side had a formatting/parsing step to convert
Instant
objects to aZonedDateTime
, so I directly usedZonedDateTime
forAnnotation
timestamps.alvarium-sdk-java/src/main/java/com/alvarium/serializers/InstantConverter.java
Lines 36 to 46 in 7e53f04
I dont think using an abstract type here is relevant as the annotation instantiation process should only be managed by the Alvarium SDK. As the serialization is also defined and peformed by the SDK, this
timestamp
field is fully internally managed, so i think we can sacrifice the O/C principle here (which is useless as the SDK has no reason to use multiple implementations of Instant) to gain some significant performances improvements.ULID Generation
Each
Annotation
object have a unique identifier.The
ULID
class is instanciated in the constructor ofAnnotation
, which will initialize a newSecureRandom
.Initializing a
SecureRandom
is expensive as shown by the profiling results, so I moved theULID
object in a static field of theAnnotation
class, and generate an identifier from it :Hash Data Only Once
The second big bottleneck that can be strongly reduced is that data is hashed by each annotators.
This is caused by a conception issue where the
Annotator
interface is responsible to instantiateAnnotation
objects, which is a quite complex type, as annotations :As the annotators creates new annotations, they are also responsible to hash the data, but the hash is always the same for each annotations, so what i did is to hash the data ahead of time, by the
Sdk
implementation, and then add it as an input to theAnnotator#annotate
method.This is a quickfix as I dont want to make this pull request refactor too much things, because I think a whole PR is needed to reconsider the utility of having as-said
Annotator
s, where what we only want is asatisfaction verdict
(look at all the implementations of theAnnotator
class, 75% of the code is duplicated, only few lines are used to build up an actual verdict).Thanks you.