Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit of substrait-cpp #1

Closed
wants to merge 1 commit into from

Conversation

chaojun-zhang
Copy link
Contributor

@chaojun-zhang chaojun-zhang commented Oct 10, 2022

This PR is an Initial commit of substrait-cpp which includes following features:

  • parse substrait extension yaml file as Type.h, Function.h and Extension.h
  • lookup function variant by types

@chaojun-zhang
Copy link
Contributor Author

@jacques-n, please help to review this PR, thanks.

@CLAassistant
Copy link

CLAassistant commented Oct 10, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put in a few specific comments but I think there are larger overall questions. I don't know if these all need to be answered now but it's probably worth discussion.

  • Is there a particular style guide you would like to follow? I'm used to Arrow-C++ which follows the Google C++ style guide but this code is quite divergent from that so do you have an alternative style guide you can point new contributors to?
  • Is your goal to create a library used by consumers? by producers? or both?
  • What operating systems and architectures are you planning on supporting directly? What sorts of targets do you hope will eventually be supported by the community as a whole?
  • What kind of philosophy do you hope to take for exception handling. I see in some places that you are throwing exceptions. Some libraries will go out of their way to try and avoid exposing exceptions for easier compatibility. Do you have a strong opinion on this?
  • What sort of philosophy do you plan on taking for encapsulating the standard library? Right now you are including in Type.h for example which puts a rather large standard include in your public API. Do you want to eventually hide these things in implementation files as much as possible?

CMakeLists.txt Outdated
set(FILESYSTEM "stdc++fs")

# Ensure we have gcc at least 8+.
if(CMAKE_CXX_COMPILER_VERSION LESS 8.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to target this version? I think this is probably a good cutoff (allows for C++17) but I was just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

CMakeLists.txt Outdated
# Ensure we have gcc at least 8+.
if(CMAKE_CXX_COMPILER_VERSION LESS 8.0)
message(
FATAL_ERROR "Substrait-CPP requires gcc > 8. Found ${CMAKE_CXX_COMPILER_VERSION}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FATAL_ERROR "Substrait-CPP requires gcc > 8. Found ${CMAKE_CXX_COMPILER_VERSION}")
FATAL_ERROR "Substrait-CPP requires gcc >= 8. Found ${CMAKE_CXX_COMPILER_VERSION}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

CMakeLists.txt Outdated
FATAL_ERROR "Substrait-CPP requires gcc > 8. Found ${CMAKE_CXX_COMPILER_VERSION}")
endif()
else()
set(FILESYSTEM "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other OS' do you plan on supporting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 16 to 18
set(proto_directory ${CMAKE_SOURCE_DIR}/third_party/substrait/proto)
set(substrait_proto_directory ${proto_directory}/substrait)
set(PROTO_OUTPUT_DIR "${CMAKE_CURRENT_SOURCE_DIR}/proto/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: the casing here seems inconsistent proto_directory vs. PROTO_OUTPUT_DIR, can we be more consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


using namespace io::substrait;

static bool decodeFunctionVariant(const Node &node, FunctionVariant &function) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I think anonymous namespaces are generally preferred over static functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to anonymous namespaces


#include "Extension.h"

namespace YAML {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace YAML {
namespace YAML {

This is a bit peculiar to me. Why namespace YAML? It's not referenced or used anywhere else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is used here to establish unqualified access to the members of the YAML namespace, which are declared by yaml-cpp. This is very bad for readability, could we have one consistent namespace for this project (namespace substrait_cpp, maybe) then refer to members of other namespaces by their qualified names (YAML::Node, for example)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 17 to 19
#include "memory"
#include "unordered_map"
#include "vector"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm used to seeing the <unordered_map> style used for standard includes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines +23 to +29
find_package(Protobuf REQUIRED)
include_directories(${PROTOBUF_INCLUDE_DIRS})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a cmake expert but why do we do find_package for protobuf but not for yaml-cpp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we have to use 'protoc' command to compile substrait protocol files before we compile this project.

src/Function.h Outdated
Comment on lines 29 to 31
virtual const bool isWildcardType() const { return false; };

virtual const bool isValueArgument() const { return false; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there is much value in returning a const bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

src/Function.h Outdated
Comment on lines 36 to 43
struct EnumArgument : public FunctionArgument {
bool required;
bool const isRequired() const override { return required; }

const std::string toTypeString() const override {
return required ? "req" : "opt";
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably going to be out of date once substrait-io/substrait#342 merges. Though I'm not sure there is much that can be done about that. Have you taken a look at substrait-io/substrait#342 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed this PR, I will rebase the substrait submodule once it merged.

CMakeLists.txt Outdated
find_package(Protobuf REQUIRED)
include_directories(${PROTOBUF_INCLUDE_DIRS})

# GCC needs to link a library to enable std::filesystem.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is std::filesystem needed for a substrait library?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

CMakeLists.txt Outdated
if("${CMAKE_CXX_COMPILER_ID}" MATCHES "GNU")
set(FILESYSTEM "stdc++fs")

# Ensure we have gcc at least 8+.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

src/Type.h Outdated
#pragma once

#include "proto/substrait/algebra.pb.h"
#include "proto/substrait/type.pb.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good idea to expose protobuf headers in this library's headers, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have any suggestions?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual solution is to not expose any protobuf-generated types in your own headers. So define your own C++ classes and enums for any types that might have an equivalent in the protobuf definitions...

This is the solution we use in Parquet (with Thrift), Arrow (with flatbuffers), Arrow Flight (with protobuf).

(I'll add that even with that, protobuf still has issues in non-trivial scenarios: substrait-io/substrait#249 ; but it will have less of them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will refactor to use our own type enums instead of protobuf based type enums.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


#include "Extension.h"

namespace YAML {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is used here to establish unqualified access to the members of the YAML namespace, which are declared by yaml-cpp. This is very bad for readability, could we have one consistent namespace for this project (namespace substrait_cpp, maybe) then refer to members of other namespaces by their qualified names (YAML::Node, for example)?


namespace YAML {

using namespace io::substrait;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using namespace io::substrait;
using namespace io::substrait;

Similar to my other comment https://github.com/substrait-io/substrait-cpp/pull/1/files#r1004739159 I recommend against using namespace. This is broadly considered bad practice by several style guides https://isocpp.org/wiki/faq/coding-standards#using-namespace-std

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

src/Type.h Outdated
Comment on lines 345 to 347
class UsedDefinedType : public TypeBase<TypeKind ::kUserDefined> {
public:
UsedDefinedType(const std::string &value) : value_(value) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class UsedDefinedType : public TypeBase<TypeKind ::kUserDefined> {
public:
UsedDefinedType(const std::string &value) : value_(value) {}
class UserDefinedType : public TypeBase<TypeKind ::kUserDefined> {
public:
UserDefinedType(const std::string &value) : value_(value) {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

src/Type.h Outdated
Comment on lines 358 to 361
#define SCALAR_ACCESSOR(KIND) \
std::shared_ptr<const ScalarType<TypeKind::KIND>> KIND()

SCALAR_ACCESSOR(kBool);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in functions declarations like:

std::shared_ptr<const ScalarType<TypeKind::kBool>> kBool();

which will surprise users who expect the kThing naming convention to refer only to enum members. In this case I think we're not saving a lot of boilerplate by using a macro and it'd be better to write the declarations without using one:

Suggested change
#define SCALAR_ACCESSOR(KIND) \
std::shared_ptr<const ScalarType<TypeKind::KIND>> KIND()
SCALAR_ACCESSOR(kBool);
std::shared_ptr<const ScalarType<TypeKind::kBool>> bool();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 30 to 32
virtual const FunctionMap scalaMapping() const {
static const FunctionMap scalaFunctionMap{};
return scalaFunctionMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your intent here was to return a constant reference to a static instance, but without the & you'll be copying it instead. That's not a problem for the default mapping, but as mappings get larger it'll add significant overhead which is unnecessary unless you intend for FunctionMappings to be mutable after construction (in which case they'd need to construct a new FunctionMap on every call).

Suggested change
virtual const FunctionMap scalaMapping() const {
static const FunctionMap scalaFunctionMap{};
return scalaFunctionMap;
virtual const FunctionMap& scalarMapping() const {
static const FunctionMap scalarFunctionMap{};
return scalarFunctionMap;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

src/Function.h Outdated
std::optional<int> max;
};

struct FunctionVariant {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this corresponds to an element of simple_extensions_schema.yaml::$defs.scalarFunction.impls. To reduce confusion, perhaps this could be renamed to FunctionImpl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw substrait-java also use FunctionVariant

Comment on lines 145 to 147
const std::string absolute_path = __FILE__;
auto const pos = absolute_path.find_last_of('/');
return absolute_path.substr(0, pos) + "/../third_party/substrait/extensions/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think relying on a macro in this fashion is extremely fragile. The standard doesn't give guarantees on this macro's precise value https://eel.is/c++draft/cpp#predefined-1.4 (could be relative to the working directory of the compiler, could fail to include any directory information at all, on msvc the directory separators are \, ...).

Moreover, this approach assumes that the path represented by getSubstraitExtensionAbsolutePath() will always refer to a valid directory; if the library were built then the repository deleted, or if the library were copied to a different machine without the rest of the files in the repo then that will break this function.

As an alternative, what if we got this path from an environment variable? Or required it to be configured by users via getSubstraitExtensionAbsolutePath(std::string)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

src/Extension.h Outdated
class Extension {
public:
/// Deserialize default substrait extension.
static std::shared_ptr<Extension> load();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaml-cpp will throw an exception in case the file cannot be found or parsing fails. Is it your intention to defer such an exception? If so, I'd say it's important to document which functions can throw and under which circumstances. In light of the fact that many c++ projects avoid usage of exceptions, you might prefer to catch yaml-cpp's exceptions and report errors through a different channel (for example, something analagous to std::expected).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it ,do you have any suggestion for this exception handling?

@chaojun-zhang
Copy link
Contributor Author

@chaojun-zhang
Copy link
Contributor Author

I've put in a few specific comments but I think there are larger overall questions. I don't know if these all need to be answered now but it's probably worth discussion.

  • Is there a particular style guide you would like to follow? I'm used to Arrow-C++ which follows the Google C++ style guide but this code is quite divergent from that so do you have an alternative style guide you can point new contributors to?

uploaded clang-format in this PR.

  • Is your goal to create a library used by consumers? by producers? or both?

The long term would be both just like substrait-java does, and the next step is to create relation & expression CPP classes to represent PROTOBUF based types.

  • What operating systems and architectures are you planning on supporting directly? What sorts of targets do you hope will eventually be supported by the community as a whole?

I hope this tiny project should support mainstream OS like ubuntu, macos etc.

  • What kind of philosophy do you hope to take for exception handling. I see in some places that you are throwing exceptions. Some libraries will go out of their way to try and avoid exposing exceptions for easier compatibility. Do you have a strong opinion on this?

Please have a look at exception handling here

  • What sort of philosophy do you plan on taking for encapsulating the standard library? Right now you are including in Type.h for example which puts a rather large standard include in your public API. Do you want to eventually hide these things in implementation files as much as possible?

Sure, I have move public API into include folder

chaojun-zhang referenced this pull request in chaojun-zhang/substrait-cpp Feb 8, 2023
Init commit of substrait-cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants