Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding distributed scalar aggregates #570

Merged
merged 17 commits into from
Mar 4, 2022

Conversation

nirandaperera
Copy link
Collaborator

No description provided.

@nirandaperera nirandaperera marked this pull request as ready for review February 20, 2022 17:03
@nirandaperera nirandaperera requested a review from vibhatha March 1, 2022 16:59
Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are minor changes that I saw, nothing serious. Thank you for the PR.


static std::shared_ptr<Column> Make(const std::string &id, const std::shared_ptr<DataType> &type,
const std::shared_ptr<arrow::Array> &data_);
static std::shared_ptr<Column> Make(std::shared_ptr<arrow::Array> data_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is data_ instead of data

@@ -52,7 +50,8 @@ class Result {
* Function pointer for aggregate functions
*/
typedef Status
(*AggregateOperation)(const std::shared_ptr<cylon::Table> &table, int32_t col_idx, std::shared_ptr<Result> &output);
(*AggregateOperation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better if we can create an alias and use it when the function pointer is used.
Another thought will it be possible to use std::function<...>?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of deprecating these all together. I think the better approach would be to use templates with a lambda function. I'll change to that.

@@ -72,7 +73,9 @@ cylon::Status Sum(const std::shared_ptr<cylon::Table> &table, int32_t col_idx, s
* @param output
* @return
*/
cylon::Status Count(const std::shared_ptr<cylon::Table> &table, int32_t col_idx, std::shared_ptr<Result> &output);
cylon::Status Count(const std::shared_ptr<cylon::Table> &table,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using &output is it possible to use *output instead?
ALso, instead of returning cylon::Status, we can return a result object like Arrow is doing.
It could simplify things. Just a thought.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we should convert all outputs to pointers. I created an issue here.
#486


net::ReduceOp reduce_op() const override { return net::SUM; }

int32_t num_intermediate_results() const override { return 3; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason why a number is returned?
may be we can add a constexpr and comment how this number is decided?

Copy link
Collaborator Author

@nirandaperera nirandaperera Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a vitual method. I think virtual constexpr are only supported from c++20 on wards,
https://stackoverflow.com/questions/34828161/can-virtual-functions-be-constexpr

The idea of this method is to get the number of intermediate elements produced by the local aggregation. Ex: std produces, count, sum, and sum of squares.

@@ -69,6 +69,7 @@ class Table {
*/
static Status FromColumns(const std::shared_ptr<CylonContext> &ctx,
const std::vector<std::shared_ptr<Column>> &columns,
const std::vector<std::string> &column_names,
std::shared_ptr<Table> &tableOut);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we use std::shared_ptr<Table> *tableOut instead?

Comment on lines +15 to +21
from pycylon.data.aggregates cimport *
from pycylon.common.status cimport CStatus
from pycylon.api.lib cimport pycylon_unwrap_context, pycylon_unwrap_table, pycylon_wrap_column

from pycylon.data.table import Table
from pycylon.ctx.context import CylonContext
from pycylon.data.column import Column
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should probably re-order the imports. But I am not sure if we checked this for other PRs.

@@ -36,3 +42,86 @@ AggregationOpString = {
'quantile': CGroupByAggregationOp.CQUANTILE,
'std': CGroupByAggregationOp.CSTDDEV,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
STD instead std
QUANTILE instead quantile

@@ -36,3 +42,86 @@ AggregationOpString = {
'quantile': CGroupByAggregationOp.CQUANTILE,
'std': CGroupByAggregationOp.CSTDDEV,
}

def sum_table(ctx: CylonContext, table: Table, skipna=True) -> Column:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipna: bool = True

Comment on lines +59 to +62
cdef shared_ptr[CCylonContext] cctx = pycylon_unwrap_context(ctx)
cdef shared_ptr[CTable] ctable = pycylon_unwrap_table(table)
cdef CBasicOptions options = CBasicOptions(skipna)
cdef shared_ptr[CColumn] cresult
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this seems like a separate cdef function which can be called within each of these aggregate functions.
WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we use some sort of an enum, I'm not sure how we can achieve this.


cdef class Column:
def __cinit__(self, array: pa.Array = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably take the input as a list, a numpy array or an Arrow array. All 3 should be compatible and we can handle it inside the constructor.

@nirandaperera nirandaperera merged commit b2c0820 into cylondata:main Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants