-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial commit of substrait-cpp #1
Conversation
@jacques-n, please help to review this PR, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put in a few specific comments but I think there are larger overall questions. I don't know if these all need to be answered now but it's probably worth discussion.
- Is there a particular style guide you would like to follow? I'm used to Arrow-C++ which follows the Google C++ style guide but this code is quite divergent from that so do you have an alternative style guide you can point new contributors to?
- Is your goal to create a library used by consumers? by producers? or both?
- What operating systems and architectures are you planning on supporting directly? What sorts of targets do you hope will eventually be supported by the community as a whole?
- What kind of philosophy do you hope to take for exception handling. I see in some places that you are throwing exceptions. Some libraries will go out of their way to try and avoid exposing exceptions for easier compatibility. Do you have a strong opinion on this?
- What sort of philosophy do you plan on taking for encapsulating the standard library? Right now you are including in
Type.h
for example which puts a rather large standard include in your public API. Do you want to eventually hide these things in implementation files as much as possible?
CMakeLists.txt
Outdated
set(FILESYSTEM "stdc++fs") | ||
|
||
# Ensure we have gcc at least 8+. | ||
if(CMAKE_CXX_COMPILER_VERSION LESS 8.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to target this version? I think this is probably a good cutoff (allows for C++17) but I was just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
CMakeLists.txt
Outdated
# Ensure we have gcc at least 8+. | ||
if(CMAKE_CXX_COMPILER_VERSION LESS 8.0) | ||
message( | ||
FATAL_ERROR "Substrait-CPP requires gcc > 8. Found ${CMAKE_CXX_COMPILER_VERSION}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FATAL_ERROR "Substrait-CPP requires gcc > 8. Found ${CMAKE_CXX_COMPILER_VERSION}") | |
FATAL_ERROR "Substrait-CPP requires gcc >= 8. Found ${CMAKE_CXX_COMPILER_VERSION}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
CMakeLists.txt
Outdated
FATAL_ERROR "Substrait-CPP requires gcc > 8. Found ${CMAKE_CXX_COMPILER_VERSION}") | ||
endif() | ||
else() | ||
set(FILESYSTEM "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other OS' do you plan on supporting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
src/CMakeLists.txt
Outdated
set(proto_directory ${CMAKE_SOURCE_DIR}/third_party/substrait/proto) | ||
set(substrait_proto_directory ${proto_directory}/substrait) | ||
set(PROTO_OUTPUT_DIR "${CMAKE_CURRENT_SOURCE_DIR}/proto/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: the casing here seems inconsistent proto_directory
vs. PROTO_OUTPUT_DIR
, can we be more consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/Extension.cpp
Outdated
|
||
using namespace io::substrait; | ||
|
||
static bool decodeFunctionVariant(const Node &node, FunctionVariant &function) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: I think anonymous namespaces are generally preferred over static functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to anonymous namespaces
src/Extension.cpp
Outdated
|
||
#include "Extension.h" | ||
|
||
namespace YAML { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namespace YAML { | |
namespace YAML { |
This is a bit peculiar to me. Why namespace YAML
? It's not referenced or used anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is used here to establish unqualified access to the members of the YAML
namespace, which are declared by yaml-cpp
. This is very bad for readability, could we have one consistent namespace for this project (namespace substrait_cpp
, maybe) then refer to members of other namespaces by their qualified names (YAML::Node
, for example)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/FunctionMapping.h
Outdated
#include "memory" | ||
#include "unordered_map" | ||
#include "vector" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm used to seeing the <unordered_map>
style used for standard includes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
find_package(Protobuf REQUIRED) | ||
include_directories(${PROTOBUF_INCLUDE_DIRS}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a cmake expert but why do we do find_package
for protobuf
but not for yaml-cpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we have to use 'protoc' command to compile substrait protocol files before we compile this project.
src/Function.h
Outdated
virtual const bool isWildcardType() const { return false; }; | ||
|
||
virtual const bool isValueArgument() const { return false; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there is much value in returning a const bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/Function.h
Outdated
struct EnumArgument : public FunctionArgument { | ||
bool required; | ||
bool const isRequired() const override { return required; } | ||
|
||
const std::string toTypeString() const override { | ||
return required ? "req" : "opt"; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably going to be out of date once substrait-io/substrait#342 merges. Though I'm not sure there is much that can be done about that. Have you taken a look at substrait-io/substrait#342 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I noticed this PR, I will rebase the substrait submodule once it merged.
CMakeLists.txt
Outdated
find_package(Protobuf REQUIRED) | ||
include_directories(${PROTOBUF_INCLUDE_DIRS}) | ||
|
||
# GCC needs to link a library to enable std::filesystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is std::filesystem
needed for a substrait library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
CMakeLists.txt
Outdated
if("${CMAKE_CXX_COMPILER_ID}" MATCHES "GNU") | ||
set(FILESYSTEM "stdc++fs") | ||
|
||
# Ensure we have gcc at least 8+. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
src/Type.h
Outdated
#pragma once | ||
|
||
#include "proto/substrait/algebra.pb.h" | ||
#include "proto/substrait/type.pb.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a good idea to expose protobuf headers in this library's headers, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usual solution is to not expose any protobuf-generated types in your own headers. So define your own C++ classes and enums for any types that might have an equivalent in the protobuf definitions...
This is the solution we use in Parquet (with Thrift), Arrow (with flatbuffers), Arrow Flight (with protobuf).
(I'll add that even with that, protobuf still has issues in non-trivial scenarios: substrait-io/substrait#249 ; but it will have less of them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I will refactor to use our own type enums instead of protobuf based type enums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/Extension.cpp
Outdated
|
||
#include "Extension.h" | ||
|
||
namespace YAML { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is used here to establish unqualified access to the members of the YAML
namespace, which are declared by yaml-cpp
. This is very bad for readability, could we have one consistent namespace for this project (namespace substrait_cpp
, maybe) then refer to members of other namespaces by their qualified names (YAML::Node
, for example)?
src/Extension.cpp
Outdated
|
||
namespace YAML { | ||
|
||
using namespace io::substrait; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using namespace io::substrait; | |
using namespace io::substrait; |
Similar to my other comment https://github.com/substrait-io/substrait-cpp/pull/1/files#r1004739159 I recommend against using namespace
. This is broadly considered bad practice by several style guides https://isocpp.org/wiki/faq/coding-standards#using-namespace-std
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/Type.h
Outdated
class UsedDefinedType : public TypeBase<TypeKind ::kUserDefined> { | ||
public: | ||
UsedDefinedType(const std::string &value) : value_(value) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class UsedDefinedType : public TypeBase<TypeKind ::kUserDefined> { | |
public: | |
UsedDefinedType(const std::string &value) : value_(value) {} | |
class UserDefinedType : public TypeBase<TypeKind ::kUserDefined> { | |
public: | |
UserDefinedType(const std::string &value) : value_(value) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/Type.h
Outdated
#define SCALAR_ACCESSOR(KIND) \ | ||
std::shared_ptr<const ScalarType<TypeKind::KIND>> KIND() | ||
|
||
SCALAR_ACCESSOR(kBool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in functions declarations like:
std::shared_ptr<const ScalarType<TypeKind::kBool>> kBool();
which will surprise users who expect the kThing
naming convention to refer only to enum members. In this case I think we're not saving a lot of boilerplate by using a macro and it'd be better to write the declarations without using one:
#define SCALAR_ACCESSOR(KIND) \ | |
std::shared_ptr<const ScalarType<TypeKind::KIND>> KIND() | |
SCALAR_ACCESSOR(kBool); | |
std::shared_ptr<const ScalarType<TypeKind::kBool>> bool(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/FunctionMapping.h
Outdated
virtual const FunctionMap scalaMapping() const { | ||
static const FunctionMap scalaFunctionMap{}; | ||
return scalaFunctionMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your intent here was to return a constant reference to a static instance, but without the &
you'll be copying it instead. That's not a problem for the default mapping, but as mappings get larger it'll add significant overhead which is unnecessary unless you intend for FunctionMapping
s to be mutable after construction (in which case they'd need to construct a new FunctionMap
on every call).
virtual const FunctionMap scalaMapping() const { | |
static const FunctionMap scalaFunctionMap{}; | |
return scalaFunctionMap; | |
virtual const FunctionMap& scalarMapping() const { | |
static const FunctionMap scalarFunctionMap{}; | |
return scalarFunctionMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/Function.h
Outdated
std::optional<int> max; | ||
}; | ||
|
||
struct FunctionVariant { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this corresponds to an element of simple_extensions_schema.yaml::$defs.scalarFunction.impls
. To reduce confusion, perhaps this could be renamed to FunctionImpl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw substrait-java also use FunctionVariant
src/Extension.cpp
Outdated
const std::string absolute_path = __FILE__; | ||
auto const pos = absolute_path.find_last_of('/'); | ||
return absolute_path.substr(0, pos) + "/../third_party/substrait/extensions/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think relying on a macro in this fashion is extremely fragile. The standard doesn't give guarantees on this macro's precise value https://eel.is/c++draft/cpp#predefined-1.4 (could be relative to the working directory of the compiler, could fail to include any directory information at all, on msvc the directory separators are \
, ...).
Moreover, this approach assumes that the path represented by getSubstraitExtensionAbsolutePath()
will always refer to a valid directory; if the library were built then the repository deleted, or if the library were copied to a different machine without the rest of the files in the repo then that will break this function.
As an alternative, what if we got this path from an environment variable? Or required it to be configured by users via getSubstraitExtensionAbsolutePath(std::string)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/Extension.h
Outdated
class Extension { | ||
public: | ||
/// Deserialize default substrait extension. | ||
static std::shared_ptr<Extension> load(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaml-cpp will throw an exception in case the file cannot be found or parsing fails. Is it your intention to defer such an exception? If so, I'd say it's important to document which functions can throw and under which circumstances. In light of the fact that many c++ projects avoid usage of exceptions, you might prefer to catch yaml-cpp's exceptions and report errors through a different channel (for example, something analagous to std::expected
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it ,do you have any suggestion for this exception handling?
|
uploaded clang-format in this PR.
The long term would be both just like substrait-java does, and the next step is to create relation & expression CPP classes to represent PROTOBUF based types.
I hope this tiny project should support mainstream OS like ubuntu, macos etc.
Please have a look at exception handling here
Sure, I have move public API into include folder |
Init commit of substrait-cpp
This PR is an Initial commit of substrait-cpp which includes following features: