-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding distributed scalar aggregates #570
Conversation
� Conflicts: � cpp/src/cylon/table.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are minor changes that I saw, nothing serious. Thank you for the PR.
|
||
static std::shared_ptr<Column> Make(const std::string &id, const std::shared_ptr<DataType> &type, | ||
const std::shared_ptr<arrow::Array> &data_); | ||
static std::shared_ptr<Column> Make(std::shared_ptr<arrow::Array> data_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is data_
instead of data
@@ -52,7 +50,8 @@ class Result { | |||
* Function pointer for aggregate functions | |||
*/ | |||
typedef Status | |||
(*AggregateOperation)(const std::shared_ptr<cylon::Table> &table, int32_t col_idx, std::shared_ptr<Result> &output); | |||
(*AggregateOperation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be better if we can create an alias and use it when the function pointer is used.
Another thought will it be possible to use std::function<...>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of deprecating these all together. I think the better approach would be to use templates with a lambda function. I'll change to that.
@@ -72,7 +73,9 @@ cylon::Status Sum(const std::shared_ptr<cylon::Table> &table, int32_t col_idx, s | |||
* @param output | |||
* @return | |||
*/ | |||
cylon::Status Count(const std::shared_ptr<cylon::Table> &table, int32_t col_idx, std::shared_ptr<Result> &output); | |||
cylon::Status Count(const std::shared_ptr<cylon::Table> &table, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using &output
is it possible to use *output
instead?
ALso, instead of returning cylon::Status, we can return a result object like Arrow is doing.
It could simplify things. Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we should convert all outputs to pointers. I created an issue here.
#486
|
||
net::ReduceOp reduce_op() const override { return net::SUM; } | ||
|
||
int32_t num_intermediate_results() const override { return 3; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specific reason why a number is returned?
may be we can add a constexpr
and comment how this number is decided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a vitual method. I think virtual constexpr are only supported from c++20 on wards,
https://stackoverflow.com/questions/34828161/can-virtual-functions-be-constexpr
The idea of this method is to get the number of intermediate elements produced by the local aggregation. Ex: std produces, count, sum, and sum of squares.
@@ -69,6 +69,7 @@ class Table { | |||
*/ | |||
static Status FromColumns(const std::shared_ptr<CylonContext> &ctx, | |||
const std::vector<std::shared_ptr<Column>> &columns, | |||
const std::vector<std::string> &column_names, | |||
std::shared_ptr<Table> &tableOut); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we use std::shared_ptr<Table> *tableOut
instead?
from pycylon.data.aggregates cimport * | ||
from pycylon.common.status cimport CStatus | ||
from pycylon.api.lib cimport pycylon_unwrap_context, pycylon_unwrap_table, pycylon_wrap_column | ||
|
||
from pycylon.data.table import Table | ||
from pycylon.ctx.context import CylonContext | ||
from pycylon.data.column import Column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should probably re-order the imports. But I am not sure if we checked this for other PRs.
@@ -36,3 +42,86 @@ AggregationOpString = { | |||
'quantile': CGroupByAggregationOp.CQUANTILE, | |||
'std': CGroupByAggregationOp.CSTDDEV, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
STD
instead std
QUANTILE
instead quantile
@@ -36,3 +42,86 @@ AggregationOpString = { | |||
'quantile': CGroupByAggregationOp.CQUANTILE, | |||
'std': CGroupByAggregationOp.CSTDDEV, | |||
} | |||
|
|||
def sum_table(ctx: CylonContext, table: Table, skipna=True) -> Column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipna: bool = True
cdef shared_ptr[CCylonContext] cctx = pycylon_unwrap_context(ctx) | ||
cdef shared_ptr[CTable] ctable = pycylon_unwrap_table(table) | ||
cdef CBasicOptions options = CBasicOptions(skipna) | ||
cdef shared_ptr[CColumn] cresult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this seems like a separate cdef function which can be called within each of these aggregate functions.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless we use some sort of an enum, I'm not sure how we can achieve this.
|
||
cdef class Column: | ||
def __cinit__(self, array: pa.Array = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could probably take the input as a list, a numpy array or an Arrow array. All 3 should be compatible and we can handle it inside the constructor.
No description provided.