Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements kj::HashSet support for jsg. #3550

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dom96
Copy link
Collaborator

@dom96 dom96 commented Feb 14, 2025

Test Plan

bazel run @workerd//src/workerd/jsg:value-test

Copy link
Member

@kentonv kentonv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intended use case here?

IMO HashSet is not the right C++ representation to correspond to a JS Set. The problem is, in order to convert between JS Set and HashSet, you necessarily have to enumerate the whole set, which is O(n), which probably defeats the purpose of using a Set vs. just using an array or something.

Probably all use cases fall into one of two categories:

  • The C++ consumer is going to iterate the whole set. In this case, building the hash index is a waste of time. It would be better to simply deliver an array.
  • The C++ consumer only wants to look up certain values in the set. In this case, what is really wanted is some sort of a wrapper that does not iterate the whole set upfront, but instead looks up specific values on-demand.

Or conversely for the C++ -> JS direction:

  • The C++ producer is going to construct a one-off set to pass to JS. There's again no reason to construct a hash index, the C++ representation should be an array or vector.
  • The C++ producer is passing some long-lived set to JS. In this case, JS should really get a view into the set that doesn't convert the contents until requested, otherwise it's wasteful to re-convert the whole set every time it is requested.

src/workerd/jsg/value.h Outdated Show resolved Hide resolved
@dom96
Copy link
Collaborator Author

dom96 commented Feb 14, 2025

The use case is #3538 (comment) where a C++ function is producing a HashSet and the JS side also wants a Set. My initial solution was to convert the set to an array and return that from the C++ side, then convert to a Set in JS. Dan suggested that jsg should support this so I implemented it.

@kentonv
Copy link
Member

kentonv commented Feb 14, 2025

Hmm, I see, this does appear to be an unusual use case where you:

  • Built a one-off kj::HashSet on the C++ side.
  • But you actually did need the hash index, because the code to build it is walking a DAG and needs to detect duplicates.
  • And then you actually want to pass this HashSet off to JS as a Set.

So converting a HashSet to a JS Set is actually what you want here, and there's no wasted index-building.

In the future we might want to add an alternative representation like:

template <typename T>
struct JsSet: public kj::Vector<T> {};

Kind of like how jsg::Optional<T> is an alias for kj::Maybe<T> that instructs JSG on how to represent it to JS.

But I guess we could also support HashSet<T>, and you use whichever one is best for the use case. I just worry about people using HashSet "because it makes sense" without thinking through problems with wasted indexing or differing equality functions...

@kentonv
Copy link
Member

kentonv commented Feb 14, 2025

I think we still do need an answer for the duplicate question when converting a Set to a HashSet. Possible answers are:

  • First instance wins.
  • Last instance wins.
  • Detect and throw a JSG error instead of "internal error".
  • Don't implement JS Set -> kj::HashSet for now. Your use case only needs the opposite direction anyway, it's not necessary to provide both.

@jasnell
Copy link
Member

jasnell commented Feb 14, 2025

... The use case is #3538 (comment) where a C++ function is producing a HashSet and the JS side also wants a Set.

It does seem rather unfortunate that what this conversion pattern ends up with is:

  1. Iterate through a collection to construct the kj::HashSet (not entirely low-cost due to the maintenance of the hash index)...
  2. Create the JS Set
  3. Iterate through the kj::HashSet to add items to the JS Set which has its own costs for its own internal hash index

It would seem to be more efficient to skip the creation of the kj::HashSet entirely if possible and just have the C++ code just directly create the v8::Set (wrapped as a jsg::JsSet) and return that directly.

Is the kj::HashSet itself being used for other things?

@dom96
Copy link
Collaborator Author

dom96 commented Feb 14, 2025

Is the kj::HashSet itself being used for other things?

Yeah, the function is also called from C++ in edgeworker to pre-load the correct Python packages.

@dom96 dom96 force-pushed the dominik/jsg-hashset-support branch 2 times, most recently from 5cb3816 to d91b4d8 Compare February 14, 2025 17:17
@dom96
Copy link
Collaborator Author

dom96 commented Feb 14, 2025

Okay, I fixed the duplicate issue and added a test for it.

src/workerd/jsg/value.h Outdated Show resolved Hide resolved
src/workerd/jsg/value.h Outdated Show resolved Hide resolved
jasnell
jasnell previously approved these changes Feb 14, 2025
Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sold on this just yet. If the hashsets are small and the objects in the set are simple enough then it's likely to be ok but there's likely a number of performance issues associated with model.

Consider, for instance, if I have a kj::HashSet<kj::String>. Let's say those strings end up being fairly non-trivial in size. To get that hash set out to JavaScript would mean multiple allocations and copies for each string in addition to iterating over the entire set.

kj::HashSet<kj::String> set;
set.add(kj::str("here is a string"));  // allocation
set.add(kj::str("here is another string"));  // allocation

Converting this to a JS set means we perform an additional allocation in the v8 heap and do a utf8 write to copy the strings in addition to v8 having to calculate the hashes of those for it's own internal hash index. That just feels super wasteful at scale when the C++ code could just start with the JS Set in the first place.

@jasnell jasnell dismissed their stale review February 14, 2025 20:02

Shouldn't have hit approve... meant to hit Request changes

@kentonv
Copy link
Member

kentonv commented Feb 14, 2025

@jasnell you are raising the same objections I did initially, but then Dominik pointed to his use case and indeed in that use case this conversion makes sense. He hash a HashSet<String> which he legitimately had to build as a HashSet (he couldn't have built it as a Vector), and he wants it to become a JS Set.

It's debatable whether that is a common enough case to build into JSG, but it does check out in this case.

@jasnell
Copy link
Member

jasnell commented Feb 14, 2025

Ok well, I won't block it but I'm still not convinced.

auto out = v8::Set::New(isolate);
for (const auto& item: set) {
out = check(out->Add(context,
static_cast<TypeWrapper*>(this)->wrap(context, creator, item).template As<v8::Value>()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this actually kj::mv(item)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, do you mean that I should pass an owned U to wrap? I don't think I can do that from a HashSet, certainly in the case of kj::String it doesn't work. Isn't passing a ref fine here?

@dom96 dom96 force-pushed the dominik/jsg-hashset-support branch from d91b4d8 to 5d12f05 Compare February 17, 2025 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants