Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for IP Address Function #9501

Merged
merged 21 commits into from
Nov 8, 2022

Conversation

SabrinaZhaozyf
Copy link
Contributor

@SabrinaZhaozyf SabrinaZhaozyf commented Sep 29, 2022

Label = feature

This PR adds a scalar function isSubnetOff(IP_PREFIX, IP_ADDRESS) which checks if IP_ADDRESS is in the subnet of IP_PREFIX (this function handles both IPv4 and IPv6).

Both IP_PREFIX and IP_ADDRESS are STRING type.

Use Cases
SELECT isSubnetOf('1.2.3.128/26', '1.2.3.129') FROM FOO --> returns true
SELECT isSubnetOf('64:fa9b::17/64', '64:ffff::17') FROM FOO --> returns false

@siddharthteotia
Copy link
Contributor

Context - One of our observability users wanted subnet-contains and other similar functionality for IP addresses stored as STRINGs. To be used in SQL WHERE clause mostly

@siddharthteotia siddharthteotia changed the title [WIP] Add Support for IP Address Function Add Support for IP Address Function Oct 3, 2022
@SabrinaZhaozyf SabrinaZhaozyf marked this pull request as ready for review October 3, 2022 23:09
@siddharthteotia
Copy link
Contributor

siddharthteotia commented Oct 4, 2022

Can we add some e2e query execution tests where we use this in query execution (WHERE clause) similar to how you added other transform functions in the past ?

Let's add some tests with meaningful IP addresses and then test out the functionality for both IPv4 and IPv6.

return true;
}

int shift;
Copy link
Contributor

@siddharthteotia siddharthteotia Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a standard / known way of writing this logic ? We need to come up with comprehensive tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no Java native library functions that handle both IPv4 and IPv6 prefix. I added comments on the alg details.

Presto uses a similar approach to compute the min and max for a given IP prefix.
max: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/IpPrefixFunctions.java#L89
min: https://github.com/prestodb/presto/blob/8a488efc63c94273c7961c87b6ab9adf141c83bb/presto-main/src/main/java/com/facebook/presto/type/IpPrefixOperators.java#L152

@codecov-commenter
Copy link

codecov-commenter commented Oct 4, 2022

Codecov Report

Merging #9501 (a4093cb) into master (9f5aa89) will increase coverage by 39.29%.
The diff coverage is 42.85%.

@@              Coverage Diff              @@
##             master    #9501       +/-   ##
=============================================
+ Coverage     24.62%   63.91%   +39.29%     
- Complexity       53     4910     +4857     
=============================================
  Files          1935     1893       -42     
  Lines        103815   101871     -1944     
  Branches      15758    15529      -229     
=============================================
+ Hits          25560    65107    +39547     
+ Misses        75635    31997    -43638     
- Partials       2620     4767     +2147     
Flag Coverage Δ
integration2 ?
unittests1 67.40% <42.85%> (?)
unittests2 15.76% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...not/common/function/scalar/IpAddressFunctions.java 42.85% <42.85%> (ø)
...va/org/apache/pinot/core/routing/RoutingTable.java 0.00% <0.00%> (-100.00%) ⬇️
...va/org/apache/pinot/common/config/NettyConfig.java 0.00% <0.00%> (-100.00%) ⬇️
...a/org/apache/pinot/common/metrics/MinionMeter.java 0.00% <0.00%> (-100.00%) ⬇️
...g/apache/pinot/common/metrics/ControllerMeter.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/BrokerQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/MinionQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
...ache/pinot/server/access/AccessControlFactory.java 0.00% <0.00%> (-100.00%) ⬇️
...he/pinot/common/messages/SegmentReloadMessage.java 0.00% <0.00%> (-100.00%) ⬇️
...he/pinot/common/messages/TableDeletionMessage.java 0.00% <0.00%> (-100.00%) ⬇️
... and 1578 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@kkrugler
Copy link
Contributor

kkrugler commented Oct 5, 2022

We've been using https://github.com/seancfoley/IPAddress for exactly this type of address parsing. It's very well tested, and seems robust. Wondering why you wouldn't want to use it for this functionality (and eventually other IP address functions supported by Presto, such as ip_prefix).

@kkrugler
Copy link
Contributor

kkrugler commented Oct 5, 2022

Nit - PR description says the function name is isSubnetOf, but code seems to be using is_subnet_of.

@kkrugler
Copy link
Contributor

kkrugler commented Oct 5, 2022

One more comment - I think Presto handles either VARCHAR or IPADDRESS for the ip address. For one of our use cases, we've got IPv4 stored as longs in the Pinot table, which would be more efficient to query as-is versus using a custom function to convert back to a string. We could extend this support in the future, wondering if there would be a reason we shouldn't plan on being able to do that.

@siddharthteotia
Copy link
Contributor

@walterddr - please take a look as well

@siddharthteotia
Copy link
Contributor

Based on @kkrugler 's suggestion to explore library, I had asked @SabrinaZhaozyf to do some investigation along the following lines.

  • Do we think library will offer comprehensive IP processing functionality besides subnet_contains (which is the current focus of your PR but we want to add more more in future) ?

  • Suggest taking a look at IP related functions supported by Kusto / Presto / Snowflake / Postgres to get an idea of IP processing landscape

  • Do we think our impl can be simplified by using the library and be less error prone ?

  • Probably not a good idea to use the library if it only supports specific functionality.

  • Some systems support IP address as a data type (NETWORK_ADDRESS IIRC ). We don't want to go there (yet) but good to check once if using the library will force us down a particular approach

@SabrinaZhaozyf , you may want to share any of your findings here for reference / discussion.

@Jackie-Jiang @walterddr @xiangfu0 - both approaches are implemented here. Can we get consensus on how we move forward ? One thing is that we will have to overload this for different types. Have you seen a similar requirement ?

@SabrinaZhaozyf
Copy link
Contributor Author

Based on @kkrugler 's suggestion to explore library, I had asked @SabrinaZhaozyf to do some investigation along the following lines.

  • Do we think library will offer comprehensive IP processing functionality besides subnet_contains (which is the current focus of your PR but we want to add more more in future) ?
  • Suggest taking a look at IP related functions supported by Kusto / Presto / Snowflake / Postgres to get an idea of IP processing landscape
  • Do we think our impl can be simplified by using the library and be less error prone ?
  • Probably not a good idea to use the library if it only supports specific functionality.
  • Some systems support IP address as a data type (NETWORK_ADDRESS IIRC ). We don't want to go there (yet) but good to check once if using the library will force us down a particular approach

@SabrinaZhaozyf , you may want to share any of your findings here for reference / discussion.

@Jackie-Jiang @walterddr @xiangfu0 - both approaches are implemented here. Can we get consensus on how we move forward ? One thing is that we will have to overload this for different types. Have you seen a similar requirement ?

I have summarized my research on IP processing functions in other systems and the library in this doc: https://docs.google.com/document/d/1SS04jQCojcvCrrIGJX_aHocUEXXy8rpX0uUhwAren4k/edit?usp=sharing

Please feel free to provide feedback / comments and let's get consensus on how we want to proceed from here.

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we can use a specific network address type for the storage. i can imagine using no-dict columns but storing it as fixed-size data type might be desirable
see: https://www.postgresql.org/docs/current/datatype-net-types.html

Comment on lines 136 to 137
public static boolean isSubnetOfV1(String ipPrefix, String ipAddress)
throws UnknownHostException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a v1 ? and why is it throwing exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, there are 2 paths of implementations as discussed here.
The v1 is the in-house implementation without calling the library.

@kishoreg
Copy link
Member

I was about to suggest the same as Rong. Given that we are going toward s postgres with joins, let's emulate the functions as well.

https://www.postgresql.org/docs/current/functions-net.html

Copy link
Contributor

@siddharthteotia siddharthteotia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @walterddr. We agree that we can implement proper IP type and optimal storage support.

In this PR, we are just implementing the basic function for subnet contains / isSubnetOf that takes a STRING and returns BOOLEAN output.

With the orthogonal work going on function registry, later we should be able to add function that takes network address type or whatever it evolves to in long term.

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but one question. we can also follow up

Comment on lines +73 to +78
IPAddress prefix = getPrefix(ipPrefix);
IPAddress ip = getAddress(ipAddress);
if (ip.isPrefixed()) {
throw new IllegalArgumentException("IP Address " + ipAddress + " should not be prefixed.");
}
return prefix.contains(ip);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we allow string input then we need to handle this (e.g. malformed string should be handled. throwing here seems a bit risky)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @walterddr , thanks for the comment! Could you please elaborate a bit more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually no worries. we can follow up.

  • upon checking some other systems, very little has IPADDRESS built in
  • PostgreSQL throws when IP varchar cast is a malform string
  • MySQL has a is IP address function to check if a varchar is a IP address format.

so IMO,

  • for now this impl is good.
  • we can provide a safety net like MySQL later: CASE WHEN is_ipaddress(col1) AND is_ipaddress(col2) THEN isSubnetOf(col1, col2) ELSE false END

@siddharthteotia siddharthteotia merged commit e3f2835 into apache:master Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants