Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance GSON Converters, hash data only once, cache ULID generator class #137

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Override-6
Copy link

@Override-6 Override-6 commented May 30, 2024

Hello,

This Pull Request implements the three non-breaking changes ideas as discussed in the report mentioned in #136 (section 4.1, page 6).

What has been improved

JSON Optimisation

The main improvement is by optimising GSON serialization.

In the report, I say that 25% of a data annotation process goes to serializing the annotations to JSON.

Avoid GSON reflexion serializer

The profiling results shows that the main part of this overhead is that Gson has to use the java reflection library to automatically serialize the annotations, as no JsonSerializer is properly registered in the GSON context.

Currently, the GSON library seems to be misused :

public class AnnotationConverter implements JsonSerializer<Annotation>, JsonDeserializer<Annotation> {
public JsonElement serialize(Annotation src, Type typeOfSrc, JsonSerializationContext context) {
return JsonParser.parseString(src.toJson());
}
public Annotation deserialize(JsonElement json, Type typeOfT, JsonDeserializationContext context) {
return Annotation.fromJson(json.toString());
}
}

By doing this, the Annotation is converted to a json string (using Annotation#toJson), which is then parsed back to a JSONElement.
The implementation of Annotation#toJson is the following :

public String toJson() {
Gson gson = new GsonBuilder().registerTypeAdapter(Instant.class, new InstantConverter())
.create();
return gson.toJson(this, Annotation.class);
}

No converter is set to properly depict the structure of an annotation, so, in addition to this JSON indirection, Gson will have to use reflection to serialize, which is very unefficient. So I defined a JsonSerializer for the Annotation type to avoid this.

InstantConverter and ZonedDateTime

I also saw that the InstantConverter's serialization side had a formatting/parsing step to convert Instant objects to a ZonedDateTime, so I directly used ZonedDateTime for Annotation timestamps.

public JsonElement serialize(Instant src, Type typeOfSrc, JsonSerializationContext context) {
// a workaround to preserve the zone information
DateTimeFormatter f = DateTimeFormatter.ISO_INSTANT.withZone(ZoneId.systemDefault());
ZonedDateTime zdt = ZonedDateTime.parse(DateTimeFormatter.ISO_INSTANT.format(src), f);
String raw = zdt.toString();
if (raw.indexOf('[') > 0) {
return new JsonPrimitive(zdt.toString().substring(0, raw.indexOf('[')));
}
return new JsonPrimitive(raw);
}

I dont think using an abstract type here is relevant as the annotation instantiation process should only be managed by the Alvarium SDK. As the serialization is also defined and peformed by the SDK, this timestamp field is fully internally managed, so i think we can sacrifice the O/C principle here (which is useless as the SDK has no reason to use multiple implementations of Instant) to gain some significant performances improvements.

ULID Generation

Each Annotation object have a unique identifier.
The ULID class is instanciated in the constructor of Annotation, which will initialize a new SecureRandom.

Initializing a SecureRandom is expensive as shown by the profiling results, so I moved the ULID object in a static field of the Annotation class, and generate an identifier from it :

+ private static final ULID ULID = new ULID();

public Annotation(String key, HashType hash, String host, LayerType layer, AnnotationType kind, String signature,
      Boolean isSatisfied, Instant timestamp) {
-   ULID ulid = new ULID(); 
-   this.id = ulid.nextULID(); 
+   this.id = ULID.nextULID();

Hash Data Only Once

The second big bottleneck that can be strongly reduced is that data is hashed by each annotators.

This is caused by a conception issue where the Annotator interface is responsible to instantiate Annotation objects, which is a quite complex type, as annotations :

  • are associated to a timestamp
  • are bound to a specific data by a hash
  • are bound to a sedevice by the device's hostname
  • are signed

As the annotators creates new annotations, they are also responsible to hash the data, but the hash is always the same for each annotations, so what i did is to hash the data ahead of time, by the Sdk implementation, and then add it as an input to the Annotator#annotate method.

This is a quickfix as I dont want to make this pull request refactor too much things, because I think a whole PR is needed to reconsider the utility of having as-said Annotators, where what we only want is a satisfaction verdict (look at all the implementations of the Annotator class, 75% of the code is duplicated, only few lines are used to build up an actual verdict).

Thanks you.

@Override-6
Copy link
Author

Override-6 commented May 30, 2024

By the way, a lot of the diff is caused by reformat (i'm an addict to ctrl+alt+l in IJ!), what do you think of opening a pull request to setup the maven formatter plugin ?

@Override-6 Override-6 force-pushed the optimisation/nonbreaking branch from 6e5f534 to bc8e871 Compare May 31, 2024 10:11
@tsconn23 tsconn23 requested a review from Ali-Amin June 3, 2024 15:11
@tsconn23 tsconn23 added the enhancement New feature or request label Jun 3, 2024
Copy link
Contributor

@Ali-Amin Ali-Amin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Override-6, thanks for the suggestions. I have split the review into two parts, taking on the high-level approach you've suggested which I'll discuss here, and some code-specific comments which I'll write down below.

Hashing in the SDK top-level calls

While I think that the hashing of a piece of data should be the role of the Annotator since it is a field on the Annotation being created, I can't deny that the performance gained from this change is significant enough for us to overlook this design choice. I think we should give this a greenlight.

Additionally, I would have hoped that we can replace the data parameter in the annotator execute method to be the hash parameter, but as you already probably know the PkiAnnotator relies on the data, and is also the only annotator that does so. I think we should keep this in mind as it has been discussed that we want to remove the reliance of that annotator on the incoming data schema... just pinning a thought here.

Base64 Encoding Annotations in Publish Wrappers

I believe it's been discussed that breaking changes should not be entertained at this time. Other Alvarium services expect that the publish wrapper will contain the annotation list as a Base64-encoded value. This change should probably be reverted.

ULID Generation

I think this is on the right track, I am not a fan of static variables though. Do you think the same idea you have here can be approached differently ? I was thinking something along the lines of a ULID provider class, similar to how hashing and signing works. Interested to hear your ideas here.

Annotation Serialization

I think this is a straight up improvement on what was already there, just some implementation comments below.

pom.xml Outdated Show resolved Hide resolved
Comment on lines -17 to +36
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

import com.alvarium.annotators.Annotator;
import com.alvarium.annotators.AnnotatorConfig;
import com.alvarium.annotators.AnnotatorException;
import com.alvarium.annotators.AnnotatorFactory;
import com.alvarium.contracts.Annotation;
import com.alvarium.contracts.AnnotationList;
import com.alvarium.contracts.AnnotationType;
import com.alvarium.hash.HashProviderFactory;
import com.alvarium.hash.HashTypeException;
import com.alvarium.streams.StreamException;
import com.alvarium.streams.StreamProvider;
import com.alvarium.streams.StreamProviderFactory;
import com.alvarium.utils.ImmutablePropertyBag;
import com.alvarium.utils.PropertyBag;

import org.apache.logging.log4j.Logger;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

Copy link
Contributor

@Ali-Amin Ali-Amin Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to our style guide which contains a section for imports and the correct ordering and grouping. It's a combination of industry styling guides used by Google and Twitter.

I understand these changes are probably due to a formatter, but what maintainers here usually do is create configuration on the editor of choice prior to making changes that reflect what is in that style guide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, why not trying to use a maven plugin to help apply the code style policy ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not, we can look into it.

src/main/java/com/alvarium/DefaultSdk.java Outdated Show resolved Hide resolved
src/main/java/com/alvarium/DefaultSdk.java Outdated Show resolved Hide resolved
src/main/java/com/alvarium/annotators/TlsAnnotator.java Outdated Show resolved Hide resolved
public class AlvariumPersistence {


public static final Gson GSON = new GsonBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what you've seen in your tests, does just instantiating this object each time a serialization is made take a lot away from performance ? I can understand the reasoning behind making this static since it's not really a changing value, but if it doesn't affect performance I don't see why not instantiate a new AlvariumPersistence each time we need to serialize something.

Copy link
Author

@Override-6 Override-6 Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesnt that much, I could create a getGson() but a Gson object is not something we gonna mutate, so why not place it in a static place and access it.

I made the GSON instance private, and now expose three static functions from the classs, so i think placing it in a static field is even more adapted now :

public class PeristenceFunctions {

  private static final Gson GSON = ...;

  public static String serializeWrapper(PublishWrapper wrapper) {
    return GSON.toJson(wrapper);
  }

  public static String serializeAnnotation(Annotation annotation) {
    return GSON.toJson(annotation);
  }

  public static Annotation deserializeAnnotation(String jsonAnnotation) {
    return GSON.fromJson(jsonAnnotation, Annotation.class);
  }

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the reason behind my suggestion was that we could just do something like this:

public class JSONSerializer<T> {

  private final Gson gson;
  
  public JSONSerializer() {
      gson = new GsonBuilder()
        .registerTypeAdapter(PublishWrapper.class, new PublishWrapperConverter())
        .registerTypeAdapter(ZonedDateTime.class, new ZonedDateTimeConverter())
        .registerTypeAdapter(Annotation.class, new AnnotationConverter())
        .disableHtmlEscaping()
        .create();
  }

  public String toJson(T serializable) {
    return this.gson.toJson(serializable);
  }


  public T fromJson(String json, java.lang.Class<T> classOfT) {  // maybe there is a way to skip the second parameter, I haven't dug deep enough to check
    return this.gson.fromJson(json, classOfT);
  }

}

and to use it we can just do new JSONSerializer<Annotation>().toJson(annotation) or new JSONSerializer<Annotation>().fromJson(json, Annotation.class). No static instances of anything, and I assume no detriment to performance.

Thoughts ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two issues here :

  1. The Gson is configured to be only used by Alvarium (due to the ZonedDateTime adapter that is specific to our use.), i don't see why we want to use a generic method here, as we will only need to support persistance operations for Annotation and PublishWrapper.
    I think that constraining the JSONSerializer with more typed method (serializeWrapper, serializeAnnotation etc) helps to encapsulate the Gson, and so we ensure that we only use the serializer for what it is intended for (as we saw that using it for unknown structures leads to serious performances degradation).

  2. new JSONSerializer().toJson() is semantically the same as JSONSerializer.toJson(), as there is no possible way to mutate the wrapped Gson instance.
    A Gson instance is thread safe for both serialization and deserialization.
    I really dont see any advantage (in both semantic and runtime) in not using a static constant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense. I like the idea of enforcing the use of the serializer to only specific types in order to avoid performance issues, so I am sold on that part. But you do realize this will require a fair bit of coupling between the serializers package and any serializable class... Do you think it is possible to move the serialization logic as a member of the respective class ? i.e., something on the lines of String json = publishWrapper.toJson() while keeping all performance improvements ?

Copy link
Author

@Override-6 Override-6 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry i don't understand, do you want to move the functions of the PersistanceFunctions.serializeX to X.toJson() ? If yes then the methods already exists in PublishWrapper and Annotation.

Or do you want to have a GSON instance in PublishWrapper and Annotation and then those gson are specialized to the class they are associated ? If yes then this would be like it is done currently, and this will force you to ensure that the changes you make are set in all Gsons (if you add an adpater then you'll need to not forget to place it in all Gson) this looks like code duplication i think 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you want to have a GSON instance in PublishWrapper and Annotation and then those gson are specialized to the they are associated ? If yes then would be like it is done currently, and this will force you to ensure that the changes you make are set in all Gsons (if you add an adpater then you'll need to not forget to place it in all Gson) this looks like code duplication i think 🤔

Yes, this is what I meant. I see the implication of code duplication here, and yes it would be a worse approach. I was checking to see if you have any ideas that can improve on the coupling across packages without any downsides, as I have none, but it seems this really is the best approach we have on hand at the moment.

My only gripe at the moment is with the class name still 😄 "Persistence" is confusing in a software development space, which is usually used for long-term persistence such as DBs and storage, and when someone reads PersistenceFunctions.serializeWrapper(wrapper), they might think the output is actually being stored somewhere, somehow. Just a suggestion, how about Serialization.toJson(wrapper) which tells me that only serialization is happening as well as using which format. This makes extensibility to other formats easier as well.

public class Serialization {


  private static final Gson GSON = new GsonBuilder()
      .registerTypeAdapter(PublishWrapper.class, new PublishWrapperConverter())
      .registerTypeAdapter(ZonedDateTime.class, new ZonedDateTimeConverter())
      .registerTypeAdapter(Annotation.class, new AnnotationConverter())
      .disableHtmlEscaping()
      .create();

  public static String toJson(PublishWrapper wrapper) {
    return GSON.toJson(wrapper);
  }

  public static String toJson(Annotation annotation) {
    return GSON.toJson(annotation);
  }

  public static Annotation annotationFromJson(String jsonAnnotation) {
    return GSON.fromJson(jsonAnnotation, Annotation.class);
  }


}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure no problem !

@Override-6
Copy link
Author

Override-6 commented Jun 7, 2024

Hey @Ali-Amin, Thanks for the review.

Hashing in the SDK top-level calls

For me the Annotator interface is a bad conception choice, as it forces its implementation to instantiate a whole Annotation object, which contains many stuff such as signing, hashing, tag, timestamp, hostname etc.

All those things makes the implementation bother with additional operations it shouldnt, and this is why you had to make the hashing process for each annotation, or to inject the layer, the hash, and signature to each annotators.
This forces a lot of code duplication, where only few lines are relevant in the implementation.
In this branch, I wanted to apply all the breaking changes i discussed, with some refactor ideas. One of the refactoring ideas is to replace the Annotator interface with an EnvironmentChecker, that returns true or false if the check it is responsible for is satisfied or not.
This makes the checker only responsible for its check, and improves maintainability and also Developer Experience as the user that wants to add its own check does not have to understand those processes.

I think the responsibility to create Annotation needs to be taken up (inside the Sdk implementation).
This will add more global control over the data being manipulated, and will reduce the work needed when you want to update annotations, refer to PR #135 where the OP had to add its tag overriding system in each annotators, where the PR's diff would only be inside the DefaultSdk class.

ULID Generation

Why not simply using UUID.randomUUID().toString() instead ?, the annotations are already marked by a timestamp

Format

I think we could open a PR to set up a code formatter and configure it to apply the code style convention?

@Override-6 Override-6 force-pushed the optimisation/nonbreaking branch from 4f31b02 to 9167729 Compare June 7, 2024 09:09
@Override-6 Override-6 force-pushed the optimisation/nonbreaking branch from 9167729 to b5374cb Compare June 7, 2024 09:13
@Override-6
Copy link
Author

Override-6 commented Jun 7, 2024

Additionally, I would have hoped that we can replace the data parameter in the annotator execute method to be the hash parameter, but as you already probably know the PkiAnnotator relies on the data, and is also the only annotator that does so. I think we should keep this in mind as it has been discussed that we want to remove the reliance of that annotator on the incoming data schema... just pinning a thought here.

Why not removing the data from the Annotator.execute method, and let the user add the data it wants to check in the PropertyBag ?

In this scala sdk poc, I wanted to use a more typesafe way to inject things at "annotation evaluation time", maybe we could use it here, which could allow the PkiAnnotator to stay in the sdk!

@Override-6 Override-6 requested a review from Ali-Amin June 7, 2024 09:28
@Ali-Amin
Copy link
Contributor

Hey @Ali-Amin, Thanks for the review.

Hashing in the SDK top-level calls

For me the Annotator interface is a bad conception choice, as it forces its implementation to instantiate a whole Annotation object, which contains many stuff such as signing, hashing, tag, timestamp, hostname etc.

All those things makes the implementation bother with additional operations it shouldnt, and this is why you had to make the hashing process for each annotation, or to inject the layer, the hash, and signature to each annotators. This forces a lot of code duplication, where only few lines are relevant in the implementation. In this branch, I wanted to apply all the breaking changes i discussed, with some refactor ideas. One of the refactoring ideas is to replace the Annotator interface with an EnvironmentChecker, that returns true or false if the check it is responsible for is satisfied or not. This makes the checker only responsible for its check, and improves maintainability and also Developer Experience as the user that wants to add its own check does not have to understand those processes.

I think the responsibility to create Annotation needs to be taken up (inside the Sdk implementation). This will add more global control over the data being manipulated, and will reduce the work needed when you want to update annotations, refer to PR #135 where the OP had to add its tag overriding system in each annotators, where the PR's diff would only be inside the DefaultSdk class.

ULID Generation

Why not simply using UUID.randomUUID().toString() instead ?, the annotations are already marked by a timestamp

Format

I think we could open a PR to set up a code formatter and configure it to apply the code style convention?

  1. We have previously discussed a similar thing to what you are referring to, an annotation 'Envelope' that handles these types of redundant processes for all annotations regardless of type. Albeit, we had discussed this for purposes other than performance optimizations, but it's good to see that this is aligned to what we are thinking. Again, I don't believe we will be entertaining a breaking change like that at this moment, but it would be good to see how your approach fits within our goals.
  2. I understand what you mean regarding ULID, but we are conforming with the go-sdk, the rust-sdk, and the python-sdk. If you see a point of improvement here, we can take that discussion to the go-sdk repo, as we initially drive changes like this from the go-sdk first then to the ports. Or we can bring it up in our next TSC call, whatever you see fit.
  3. I think a formatting setup would be perfect, as long as it's IDE independent.

@Ali-Amin
Copy link
Contributor

Additionally, I would have hoped that we can replace the data parameter in the annotator execute method to be the hash parameter, but as you already probably know the PkiAnnotator relies on the data, and is also the only annotator that does so. I think we should keep this in mind as it has been discussed that we want to remove the reliance of that annotator on the incoming data schema... just pinning a thought here.

Why not removing the data from the Annotator.execute method, and let the user add the data it wants to check in the PropertyBag ?

In this scala sdk poc, I wanted to use a more typesafe way to inject things at "annotation evaluation time", maybe we could use it here, which could allow the PkiAnnotator to stay in the sdk!

I think that might be a good approach yes, because the data is needed for the annotation's 'satisfaction' logic, and we usually pass info related to that 'satisfaction' logic in the property bag for other annotators. So it makes sense to me to do the same here.

@Override-6
Copy link
Author

I understand for ULID, i forgot that changing the ID generation algorithm would also be a breaking change !

For simplicity's sake, i'm personally more in favor of keeping the ULID in a static field (I indeed could create a ULIDGenerator utility class). after that, why not adding an IDProvider in the Sdk's configuration in the future.

I think i'll open a new PR to setup a formatter

I think that might be a good approach yes, because the data is needed for the annotation's 'satisfaction' logic, and we usually pass info related to that 'satisfaction' logic in the property bag for other annotators. So it makes sense to me to do the same here.

Do you want me to implement this idea here or do i create another PR for this (once this one will be merged) ?

I am interested by what you mean by Envelope, do you have any thing i can read to better understand the idea ?

Thanks you !

@Ali-Amin
Copy link
Contributor

Ali-Amin commented Jun 10, 2024

I understand for ULID, i forgot that changing the ID generation algorithm would also be a breaking change !

For simplicity's sake, i'm personally more in favor of keeping the ULID in a static field (I indeed could create a ULIDGenerator utility class). after that, why not adding an IDProvider in the Sdk's configuration in the future.

I think the advantages of an ID generator outweigh those of a static instantiation of ULID, having a default value for the ID generator config as well might be a good idea to prevent headache for other devs.

I think i'll open a new PR to setup a formatter

Awesome, looking forward to it.

I think that might be a good approach yes, because the data is needed for the annotation's 'satisfaction' logic, and we usually pass info related to that 'satisfaction' logic in the property bag for other annotators. So it makes sense to me to do the same here.

Do you want me to implement this idea here or do i create another PR for this (once this one will be merged) ?

I think this PR is large enough, lets keep any more changes and suggestions into their own single units to keep reviews simple and fast.

I am interested by what you mean by Envelope, do you have any thing i can read to better understand the idea ?

I do not have anything on hand unfortunately, I will try to fetch the TSC call recording link we had where we discussed plans for it.

Copy link
Contributor

@Ali-Amin Ali-Amin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of everything else looks good to me. As soon as this is fixed, I'll plug these changes into a working Jenkins pipeline and ensure scores are being calculated without error.

src/main/java/com/alvarium/Sdk.java Show resolved Hide resolved
@Ali-Amin
Copy link
Contributor

I am interested by what you mean by Envelope, do you have any thing i can read to better understand the idea ?

This recording https://wiki.lfedge.org/download/attachments/68584065/Alvarium-TSC-11-Mar-2024.mp4?api=v2 should have the relevant info, the discussion starts around 9:50.

@Override-6 Override-6 requested a review from Ali-Amin June 11, 2024 14:23
@Override-6 Override-6 force-pushed the optimisation/nonbreaking branch from 83c50cd to 2859b54 Compare June 12, 2024 09:08
@Override-6 Override-6 force-pushed the optimisation/nonbreaking branch from 2859b54 to 01cd65f Compare June 12, 2024 09:16
@tsconn23
Copy link
Contributor

@Ali-Amin @Override-6 Thanks for the collaboration on this. Please have some patience w/r/t the timing of getting this work approved and merged. We have an effort in front of it that is proving difficult and I'd like to get that resolved before considering these changes. There is also some pending vacation time coming up which will add a delay. Just FYI that we're still tracking this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants