Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Python redesign for libcudf++ #3254

Merged
merged 223 commits into from
Jan 7, 2020
Merged
Show file tree
Hide file tree
Changes from 212 commits
Commits
Show all changes
223 commits
Select commit Hold shift + click to select a range
762efcd
Initial libcudf++ bindings
shwina Oct 18, 2019
7bec7b3
Add bindings for mutable_column_view
shwina Oct 21, 2019
d5d9bb7
Add initial bindings for table classes
shwina Oct 22, 2019
80f5ea5
Rework Column bindings
shwina Oct 24, 2019
868399c
Merge branch 'port-libcudf++-scatter-gather' into libcudfxx-pythonexa…
shwina Oct 24, 2019
ea27470
Initial gather libcudf++ bindings
shwina Oct 29, 2019
bc9c01f
Add Buffer.from_array_like()
shwina Oct 29, 2019
aebf743
Iterating on Python libcudf++ design
shwina Oct 29, 2019
6d83e5f
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into l…
shwina Oct 29, 2019
6bb4aa8
Replace use of _Table and _Column with release_table and release_column
shwina Oct 29, 2019
bcfbd3d
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into l…
shwina Oct 30, 2019
cf61707
Merge branch 'port-libcudf++-scatter-gather' into libcudfxx-pythonexa…
shwina Oct 30, 2019
c7165bf
Move release_*() utilities to methods
shwina Oct 30, 2019
1047cbe
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into l…
shwina Oct 31, 2019
60c65fc
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into l…
shwina Oct 31, 2019
e1d4cae
Merge branch 'port-libcudf++-scatter-gather' into libcudfxx-python
shwina Oct 31, 2019
ad62bce
Rudimentary null support in Python Column
shwina Nov 1, 2019
efa01b7
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into l…
shwina Nov 4, 2019
369a297
Merge branch 'port-libcudf++-scatter-gather' into libcudfxx-python
shwina Nov 4, 2019
9c472d8
Starting on migrating Column/Buffer to libcudf++
shwina Nov 4, 2019
54925dd
progress
shwina Nov 4, 2019
066415d
Initial port of as_column
shwina Nov 4, 2019
2193034
Initial port of nans_to_nulls
shwina Nov 4, 2019
30e7940
Add method to translate gdf_column to Column and initial port of unar…
shwina Nov 4, 2019
947fe87
Update series constructor
shwina Nov 4, 2019
208ed89
Initial port of unaryops
shwina Nov 4, 2019
3d7e532
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into l…
shwina Nov 5, 2019
36c4514
Merge branch 'port-libcudf++-scatter-gather' into libcudfxx-python
shwina Nov 5, 2019
1bcd8a6
Add Buffer.from_device_buffer
shwina Nov 5, 2019
16080d8
Allow creating numerical column from Pandas series
shwina Nov 5, 2019
890959a
Fixing calls to column constructors
shwina Nov 5, 2019
0a3f919
Just build a new column instead of call to `replace()`
shwina Nov 5, 2019
61d1bc2
Fix test
shwina Nov 5, 2019
c12ea49
Initial port of replace() to use new Column class
shwina Nov 5, 2019
3d8c12e
Migrate reduce to use new Column class. Pass all unaryops tests
shwina Nov 5, 2019
53fee64
Initial datetime support
shwina Nov 5, 2019
92e6e88
Enable creation of categorical columns
shwina Nov 5, 2019
f0e7dfc
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into l…
shwina Nov 6, 2019
49a295b
Check for buffer size in column constructors
shwina Nov 6, 2019
8a140da
Update StringColumn constructor
shwina Nov 6, 2019
a365782
Pass Series concat tests
shwina Nov 6, 2019
efc27f2
Restore usage of old Cython gdf_column functions
shwina Nov 7, 2019
a4c722b
Pass more dataframe tests
shwina Nov 7, 2019
260e5c0
Pass more dataframe tests
shwina Nov 7, 2019
871fcd1
Fix scalar validity
shwina Nov 7, 2019
9e6b78f
Pass more dataframe tests
shwina Nov 7, 2019
e2cc50c
Pass more dataframe tests
shwina Nov 7, 2019
a9b6c3b
Fix datetime dtypes issue and pass tranpose tests
shwina Nov 8, 2019
5dd7430
More progress on libcudf++ transition
shwina Nov 10, 2019
034bec9
Pass all tests in test_dataframe.py
shwina Nov 11, 2019
da53321
Fix slices that give non-contiguous views
shwina Nov 11, 2019
03eaedd
Start work on passing indexing tests
shwina Nov 11, 2019
8eb0c7f
Pass indexing tests
shwina Nov 12, 2019
9500f91
Passing apply_* tests
shwina Nov 12, 2019
f6bbfe7
Fix missing name attribute
shwina Nov 12, 2019
fd45b34
Remove column.replace() and pass binops tests
shwina Nov 12, 2019
b4ccb09
Pass categorical tests
shwina Nov 12, 2019
b60cfc3
Fix categorical __contains__
shwina Nov 12, 2019
12cf52c
Fix column_view_from_string_column
shwina Nov 12, 2019
ff046c7
Fix returning ephemeral nvstrings object from StringColumn
shwina Nov 12, 2019
073cd9d
Fixes for __cuda_array_interface__
shwina Nov 12, 2019
36fa0ad
Fix dtype comparison
shwina Nov 12, 2019
05703b1
Fix copying test
shwina Nov 12, 2019
5b2f323
Pass datetime tests
shwina Nov 12, 2019
2bbc2ba
Fixes to dlpack
shwina Nov 12, 2019
b42e0af
Pass dropna tests
shwina Nov 12, 2019
35c1778
Make nvstrings a read-only property of StringColumn to prevent
shwina Nov 13, 2019
16afe11
Fix CategoricalDtype constructor in tests
shwina Nov 13, 2019
738e0e5
Fix creating empty categorical columns
shwina Nov 13, 2019
5916804
Fix import
shwina Nov 13, 2019
ea63b8e
Pass joining tests
shwina Nov 13, 2019
7128e65
Fix is_monotonic tests
shwina Nov 13, 2019
269863c
Fix column construction in tests
shwina Nov 13, 2019
fe9ec12
Fix string normalize binop
shwina Nov 13, 2019
8c5674a
Pass replace tests
shwina Nov 13, 2019
bc48253
Pass reshape tests
shwina Nov 13, 2019
c4d38ca
Fix rolling
shwina Nov 13, 2019
1af2fc2
Fix reveresed
shwina Nov 13, 2019
841c73a
Fixes for scatter_to_tables
shwina Nov 13, 2019
78ff797
Fix data/mask accessors
shwina Nov 13, 2019
9023dc9
Fix use of _mimic_inplace in __setitem__
shwina Nov 13, 2019
196aa8a
Passing string tests
shwina Nov 13, 2019
9bc4498
Modifications to build with external library support.
thomcom Nov 13, 2019
819213d
Merge pull request #2 from thomcom/libcudfxx-python-cuspatial
shwina Nov 13, 2019
ac71733
Fix gpu_view_as to accept an nbytes argument
shwina Nov 13, 2019
87be5a9
Merge branch 'libcudfxx-python' of git+ssh://github.com/shwina/cudf i…
shwina Nov 13, 2019
75e2192
Fixing pickling issues with new column
shwina Nov 13, 2019
440bcbf
set references to self and buf on DeviceNDArrays
trxcllnt Nov 13, 2019
fd5a372
Merge branch 'libcudfxx-python' of github.com:shwina/cudf into libcud…
trxcllnt Nov 13, 2019
171c493
Fix conversion of PyArrow array to categorical series
shwina Nov 14, 2019
9065174
Merge branch 'libcudfxx-python' of git+ssh://github.com/shwina/cudf i…
shwina Nov 14, 2019
79af25b
Add serialization/deserializationn for new column
shwina Nov 14, 2019
6ca9a14
Fix cudf::column -> Column construction
shwina Nov 14, 2019
14c419e
Merge branch 'branch-0.11' of https://github.com/rapidsai/cudf into l…
shwina Nov 14, 2019
9e340df
Change build_column API to include offset and children
shwina Nov 18, 2019
e8d4a7f
Add Table class and back Series by a Table
shwina Nov 19, 2019
358bbee
Back DataFrame by Column rather than Series - initial work
shwina Dec 3, 2019
e2c1cd6
Resolve issues with binops after Series->Column port
shwina Dec 4, 2019
6251122
Fixes to csv, dlpack and groupby to support DataFrame._cols
shwina Dec 4, 2019
1651ec1
Update indexing for new DataFrame._cols
shwina Dec 4, 2019
b3abf62
Fix inplace setitem issues and support for categoricals
shwina Dec 6, 2019
f87d599
Some improvements in handling indexing and initialization
shwina Dec 9, 2019
718b4f9
Add ugly _columns_name attribute to DataFrame
shwina Dec 9, 2019
1da9141
Undo change to multiindex take
shwina Dec 9, 2019
4d77b2c
Fix dataframe take
shwina Dec 9, 2019
8bbce77
Fix orc
shwina Dec 9, 2019
5d4152e
Address issue with name propagation in scatter_to_tables
shwina Dec 10, 2019
2f96cb2
Special-case empty transpose
shwina Dec 10, 2019
8d83f87
Fix multiindex to work with new DataFrame._cols
shwina Dec 10, 2019
c31a3b3
Fix to_pandas() to work with new DataFrame._cols
shwina Dec 10, 2019
fcde620
Fix sparse_df test
shwina Dec 10, 2019
8e337dc
Merge branch 'branch-0.12' of https://github.com/rapidsai/cudf into l…
shwina Dec 11, 2019
644dc9d
Update changelog
shwina Dec 11, 2019
663be42
Remove DataFrame unaops
shwina Dec 11, 2019
c5483bd
Derive DataFrame from Table
shwina Dec 11, 2019
8f3caaa
Rename row_tuple -> column_tuple
shwina Dec 12, 2019
9ae7563
Fix the order of children in StringColumn
shwina Dec 12, 2019
4a16246
Add offset arg to Python column constructors
shwina Dec 12, 2019
2ecc074
Fix order of children in string constructor in empty_like
shwina Dec 12, 2019
2cf2e21
Add children to Cython (mutable_)column_view
shwina Dec 12, 2019
8703fa1
Fix __setitem__ issue in DataFrame. Explicitly use deep= keyword arg
shwina Dec 13, 2019
499128f
Remove unused _add_empty_columns and _add_rows
shwina Dec 13, 2019
8345db6
Missing raise
shwina Dec 13, 2019
75cae22
Sync with changes to RMM DeviceBuffer
shwina Dec 14, 2019
fd2c142
Handle array_like and DeviceBuffer input directly in Buffer ctor
shwina Dec 14, 2019
07e212f
Refactoring CategoricalColumn to be composed of a child column
shwina Dec 16, 2019
85aa2cf
Refactor CategoricalColumn
shwina Dec 16, 2019
f89532d
Replace OrderedDict with OrderedColumnDict
shwina Dec 17, 2019
bcc9ef3
Fix Buffer docstring
shwina Dec 17, 2019
69c774a
Replace use of _assign with _mimic_inplace
shwina Dec 17, 2019
d18ae26
Fix unaryops mem leak
shwina Dec 17, 2019
9a36aca
Update python/cudf/cudf/_lib/copying.pyx
shwina Dec 17, 2019
e65a6c9
Update python/cudf/cudf/_lib/copying.pyx
shwina Dec 17, 2019
e01538d
Enable creation of Buffers from memoryviews
shwina Dec 18, 2019
10bd772
Fix warnings by settings _columns_name by default
shwina Dec 18, 2019
d801718
Make index manage its own name
shwina Dec 18, 2019
ae1c76b
Actually fix memory leak
shwina Dec 18, 2019
a21d3df
Merge branch 'libcudfxx-python' of git+ssh://github.com/shwina/cudf i…
shwina Dec 18, 2019
e698477
Introduce and use build_categorical_column
shwina Dec 18, 2019
4067470
Fix dask_cudf failures after refactor
shwina Dec 18, 2019
5cac8e4
Remove Column.name
shwina Dec 19, 2019
6f47a5e
Cache null_count
shwina Dec 19, 2019
dcc7025
Add rmm as a build dependency
shwina Dec 19, 2019
db7a37e
Add except? to py_to_c_str()
shwina Dec 19, 2019
b2ef181
Update python/cudf/cudf/core/column/column.py
shwina Dec 19, 2019
4482613
Use _DevicePointer to construct a Buffer from raw pointer
shwina Dec 20, 2019
c5a8cd9
Remove tests that operate on individual DataFrame columns in-place
shwina Dec 20, 2019
39966fd
Fix missing parens
shwina Dec 20, 2019
439bbf9
Remove TIMESTAMP_DAYS from cudf dtypes
shwina Dec 20, 2019
8725aa6
Better handling of categorical dtype in (mutable)_view()
shwina Dec 20, 2019
0b531e0
Fix handling of D/W/M/Y datetime types
shwina Dec 20, 2019
83f4265
cimport move from RMM
shwina Dec 20, 2019
52d5f28
Clarify Buffer docstring
shwina Dec 20, 2019
a6b1891
Merge branch 'libcudfxx-python' of git+ssh://github.com/shwina/cudf i…
shwina Dec 20, 2019
c71b2ca
Replace use of np.array() with np.asarray()
shwina Dec 20, 2019
6f24846
Add length= parameter to as_column factory
shwina Dec 23, 2019
08e8290
Minor fixes based on feedback
shwina Dec 30, 2019
658f211
Improve CategoricalDtype.serialize()
shwina Dec 30, 2019
f669bfb
Fix CategoricalDtype.serialize/deserialize
shwina Dec 30, 2019
3f261a6
Add Column._set_mask and specialize for StringColumn
shwina Dec 30, 2019
8a5dca6
Use column_empty v/s creating empty rmm.device_array
shwina Dec 30, 2019
f0c64bd
Use path to cuda-gdb to locate CUDA_HOME instead of nvcc
shwina Dec 31, 2019
f646b7a
Change np.array->np.asarray
shwina Dec 31, 2019
8ced3a4
Add has_nulls and nullable properties to Column
shwina Jan 2, 2020
6a08b61
Add more usage of has_nulls property
shwina Jan 2, 2020
d85e671
Restore assertions in test
shwina Jan 2, 2020
e4a5e0c
Move nogil up in join.pyx
shwina Jan 2, 2020
a496541
Remove check for float in unary ops
shwina Jan 2, 2020
f158d5e
Document cached_property
shwina Jan 2, 2020
1c173fa
Check args in Column constructor
shwina Jan 2, 2020
a3e955b
Rely on RMM declaration of move(device_buffer)
shwina Jan 2, 2020
797eaba
Check data/mask type in setters
shwina Jan 2, 2020
2bf2a42
Remove stale TODOs
shwina Jan 3, 2020
cdfe118
Add brief docs for Column
shwina Jan 3, 2020
b28af05
Rename _data_view() and _mask_view()
shwina Jan 3, 2020
6285ddf
Use column_empty in place of device_array creation
shwina Jan 3, 2020
73877a0
Fix .nullmask property
shwina Jan 3, 2020
b4d1bdf
Missing import
shwina Jan 3, 2020
f8bf11e
Fix doc
shwina Jan 3, 2020
3c49df8
Use CategoricalColumn builder
shwina Jan 3, 2020
44d4c27
Get data/mask pointers directly in __cuda_array_interface__
shwina Jan 3, 2020
7ee419d
Add docstring for build_categorical_column
shwina Jan 3, 2020
4b14491
Replace to_gpu_array() with data_array_view
shwina Jan 3, 2020
2ed838b
Return column instead of device_array in column_applymap
shwina Jan 3, 2020
a5e9c99
Improve handling of __cuda_array_interface__ in as_column
shwina Jan 3, 2020
c966502
Typo
shwina Jan 3, 2020
8fbe227
Get data/mask ptrs directly
shwina Jan 3, 2020
00cfd64
Fix empty column construction
shwina Jan 3, 2020
134a19f
Fixes to numerical.py after review
shwina Jan 3, 2020
fe01684
Fixed column hashing
shwina Jan 3, 2020
0075d20
Fix numeric->string typecast
shwina Jan 3, 2020
aa62f93
Rename mask to mask_ptr in StringColumn.nvstrings
shwina Jan 3, 2020
e739aa4
Explicitly construct a Pandas index from dict keys
shwina Jan 3, 2020
2c859ff
Use empty column in df.describe()
shwina Jan 3, 2020
d568603
Add default masked=False to column_empty
shwina Jan 3, 2020
408818b
Fix mask generation in StringColumn.as_numeric
shwina Jan 3, 2020
a60e82a
Use shutil.which to find cuda-gdb
shwina Jan 3, 2020
290c47d
Default to object dtype in column_empty
shwina Jan 3, 2020
9646b73
Unskip test
shwina Jan 3, 2020
2e0c1ff
Fix empty Series construction in indexing.py
shwina Jan 3, 2020
9dc0029
Try adding default constructors for table_view and mutable_table_view
shwina Jan 3, 2020
933cdc3
Merge branch 'branch-0.12' of https://github.com/rapidsai/cudf into l…
shwina Jan 3, 2020
37257eb
Trying a fix for transpose
shwina Jan 3, 2020
147d7d2
Fixing style issues
shwina Jan 3, 2020
8fa0fb7
More style
shwina Jan 3, 2020
49549ef
Skip index=True in dask test_memory_usage
shwina Jan 6, 2020
1fb91ec
Remove extraneous comma
shwina Jan 6, 2020
55e85a2
Fix scalar_broadcast_to to work with an integer size instead of a shape
shwina Jan 6, 2020
53cb962
Convert columns to pd.Index in transpose
shwina Jan 6, 2020
defae49
Restore RangeIndex after an Iloc
shwina Jan 6, 2020
e6910e3
Undo modifications to dask memory_usage test
shwina Jan 6, 2020
b4d2f43
Fix changelog
shwina Jan 6, 2020
59b0d36
Fix bug in iloc returning a RangeIndex
shwina Jan 6, 2020
9ff9261
Remove unused import
shwina Jan 6, 2020
74f4527
Merge branch 'branch-0.12' of https://github.com/rapidsai/cudf into l…
shwina Jan 7, 2020
151475a
Remove print statements
shwina Jan 7, 2020
ba37d79
Update python/cudf/cudf/_lib/cudf.pyx
shwina Jan 7, 2020
3514463
Update python/cudf/cudf/core/column/categorical.py
shwina Jan 7, 2020
917154b
Better handling of CUDA_HOME in setup.py
shwina Jan 7, 2020
6fbf958
Remove unused variable
shwina Jan 7, 2020
3808a31
Explicitly check if a valid is not NULL
shwina Jan 7, 2020
20b1781
Merge branch 'libcudfxx-python' of git+ssh://github.com/shwina/cudf i…
shwina Jan 7, 2020
201e9eb
Update python/cudf/cudf/core/column/column.py
shwina Jan 7, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ python/*/build
python/cudf/*/_lib/**/*.cpp
python/cudf/*/_lib/**/*.h
python/cudf/*/_lib/.nfs*
python/cudf/*/_libxx/**/*.cpp
python/cudf/*/_libxx/**/*.h
python/cudf/*/_libxx/.nfs*
python/cudf/*.ipynb
python/cudf/.ipynb_checkpoints
python/*/record.txt
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

- PR #3224 Define and implement new join APIs.
- PR #3284 Add gpu-accelerated parquet writer
- PR #3254 Python redesign for libcudf++
- PR #3336 Add `from_dlpack` and `to_dlpack`
- PR #3555 Add column names support to libcudf++ io readers and writers
- PR #3610 Add memory_usage to DataFrame and Series APIs
Expand Down
1 change: 1 addition & 0 deletions conda/recipes/cudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ requirements:
- dlpack
- pyarrow 0.15.0.*
- libcudf {{ version }}
- rmm {{ minor_version }}.*
run:
- python
- pandas>=0.24.2,<0.25
Expand Down
8 changes: 6 additions & 2 deletions cpp/include/cudf/table/table_view.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ class table_view_base {
*
* @throws std::out_of_range
* If `column_index` is out of the range [0, num_columns)
*
*
* @param column_index The index of the desired column
* @return A reference to the desired column
*---------------------------------------------------------------------------**/
Expand All @@ -124,7 +124,7 @@ class table_view_base {
*---------------------------------------------------------------------------**/
size_type num_rows() const noexcept { return _num_rows; }

table_view_base() = delete;
table_view_base() = default;
shwina marked this conversation as resolved.
Show resolved Hide resolved

~table_view_base() = default;

Expand All @@ -148,6 +148,8 @@ class table_view : public detail::table_view_base<column_view> {
public:
using ColumnView = column_view;

table_view() = default;

/**---------------------------------------------------------------------------*
* @brief Construct a table from a vector of table views
*
Expand Down Expand Up @@ -193,6 +195,8 @@ class mutable_table_view : public detail::table_view_base<mutable_column_view> {
public:
using ColumnView = mutable_column_view;

mutable_table_view() = default;

mutable_column_view& column(size_type column_index) const {
return const_cast<mutable_column_view&>(table_view_base::column(column_index));
}
Expand Down
4 changes: 3 additions & 1 deletion python/cudf/cudf/_lib/concat.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,19 @@

from cudf._lib.cudf cimport *
from cudf._lib.cudf import *
from cudf._libxx.column cimport Column
from libc.stdlib cimport free
from libcpp.vector cimport vector

from cudf._lib.includes.concat cimport gdf_column_concat


def _column_concat(cols_to_concat, output_col):
def _column_concat(cols_to_concat, Column output_col):
cdef gdf_column* c_output_col = column_view_from_column(output_col)
cdef vector[gdf_column*] c_input_cols
cdef int num_cols = len(cols_to_concat)

cdef Column col
for col in cols_to_concat:
c_input_cols.push_back(column_view_from_column(col))

Expand Down
100 changes: 57 additions & 43 deletions python/cudf/cudf/_lib/copying.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@
# cython: embedsignature = True
# cython: language_level = 3

import cudf
from cudf.core.buffer import Buffer
from cudf._lib.cudf cimport *
from cudf._lib.cudf import *

import cudf.utils.utils as utils
from cudf.utils.dtypes import is_string_dtype
from cudf.utils.dtypes import is_string_dtype, is_categorical_dtype
from cudf._lib.utils cimport (
columns_from_table,
table_from_columns,
Expand Down Expand Up @@ -48,7 +49,7 @@ def clone_columns_with_size(in_cols, row_size):
for col in in_cols:
o_col = column.column_empty_like(col,
dtype=col.dtype,
masked=col.has_null_mask,
masked=col.mask,
shwina marked this conversation as resolved.
Show resolved Hide resolved
newsize=row_size)
out_cols.append(o_col)

Expand All @@ -60,7 +61,7 @@ def _normalize_maps(maps, size):

maps = column.as_column(maps).astype("int32")
maps = maps.binary_operator("mod", np.int32(size))
maps = maps.data.mem
maps = maps.data_array_view
return maps


Expand Down Expand Up @@ -88,10 +89,7 @@ def gather(source, maps, bounds_check=True):
for i, in_col in enumerate(in_cols):
in_cols[i] = column.as_column(in_cols[i])

if is_string_dtype(in_cols[0]):
in_size = in_cols[0].data.size()
else:
in_size = in_cols[0].data.size
in_size = in_cols[0].size

maps = column.as_column(maps)

Expand All @@ -110,11 +108,12 @@ def gather(source, maps, bounds_check=True):

for i, in_col in enumerate(in_cols):
if isinstance(in_col, CategoricalColumn):
out_cols[i] = CategoricalColumn(
data=out_cols[i].data,
mask=out_cols[i].mask,
out_cols[i] = column.build_categorical_column(
categories=in_col.cat().categories,
ordered=in_col.cat().ordered)
codes=out_cols[i],
mask=out_cols[i].mask,
ordered=in_col.cat().ordered
)

free_column(c_maps)
free_table(c_in_table)
Expand Down Expand Up @@ -163,6 +162,15 @@ def scatter(source, maps, target, bounds_check=True):

result_cols = columns_from_table(&c_result_table)

for i, in_col in enumerate(target_cols):
if is_categorical_dtype(in_col.dtype):
result_cols[i] = column.build_categorical_column(
categories=in_col.cat().categories,
codes=result_cols[i],
mask=result_cols[i].mask,
ordered=in_col.cat().ordered
)

del c_source_table
del c_target_table
kkraus14 marked this conversation as resolved.
Show resolved Hide resolved

Expand All @@ -189,7 +197,8 @@ def copy_column(input_col):

def copy_range(out_col, in_col, int out_begin, int out_end,
int in_begin):
from cudf.core.column import Column

from cudf.core.column import as_column

if abs(out_end - out_begin) <= 1:
return out_col
Expand All @@ -202,17 +211,15 @@ def copy_range(out_col, in_col, int out_begin, int out_end,
if out_begin > out_end:
return out_col

if out_col.null_count == 0 and in_col.has_null_mask:
if not out_col.has_nulls and in_col.nullable:
mask = utils.make_mask(len(out_col))
cudautils.fill_value(mask, 0xff)
out_col._mask = Buffer(mask)
out_col._null_count = 0
shwina marked this conversation as resolved.
Show resolved Hide resolved
out_col.mask = Buffer(mask)

if in_col.null_count == 0 and out_col.has_null_mask:
if not in_col.has_nulls and out_col.nullable:
mask = utils.make_mask(len(in_col))
cudautils.fill_value(mask, 0xff)
in_col._mask = Buffer(mask)
in_col._null_count = 0
in_col.mask = Buffer(mask)

cdef gdf_column* c_out_col = column_view_from_column(out_col)
cdef gdf_column* c_in_col = column_view_from_column(in_col)
Expand All @@ -224,20 +231,23 @@ def copy_range(out_col, in_col, int out_begin, int out_end,
out_end,
in_begin)

out_col._update_null_count(c_out_col.null_count)

if is_string_dtype(out_col) and len(out_col) > 0:
update_nvstrings_col(
out_col,
<uintptr_t>c_out_col.dtype_info.category)
nvcat_ptr = int(<uintptr_t>c_out_col.dtype_info.category)
nvcat_obj = None
if nvcat_ptr:
nvcat_obj = nvcategory.bind_cpointer(nvcat_ptr)
nvstr_obj = nvcat_obj.to_strings()
else:
nvstr_obj = nvstrings.to_device([])
out_col = as_column(nvstr_obj)

free_column(c_in_col)
free_column(c_out_col)

return out_col


def scatter_to_frames(source, maps, index=None):
def scatter_to_frames(source, maps, index=None, names=None, index_names=None):
"""
Scatters rows to 'n' dataframes according to maps

Expand All @@ -251,39 +261,37 @@ def scatter_to_frames(source, maps, index=None):
-------
list of scattered dataframes
"""
from cudf.core.column import column, CategoricalColumn
from cudf.core.column import column, build_column, build_categorical_column
from cudf.core.series import Series

in_cols = source

if index:
ind_names = [ind.name for ind in index]
ind_names_tmp = [(ind_name or "_tmp_index") for ind_name in ind_names]
ind_names_tmp = [(ind_name or "_tmp_index")
for ind_name in index_names]
for i in range(len(index)):
index[i].name = ind_names_tmp[i]
in_cols.append(index[i])
names.append(ind_names_tmp[i])

col_count=len(in_cols)
if col_count == 0:
return []

cats = {}
for i, in_col in enumerate(in_cols):
in_cols[i] = column.as_column(in_cols[i])
if isinstance(in_cols[i], CategoricalColumn):
cats[in_cols[i].name] = (
Series(in_cols[i]._categories),
in_cols[i]._ordered
if is_categorical_dtype(in_cols[i]):
cats[names[i]] = (
Series(in_cols[i].categories),
in_cols[i].ordered
)

if is_string_dtype(in_cols[0]):
in_size = in_cols[0].data.size()
else:
in_size = in_cols[0].data.size
in_size = in_cols[0].size

maps = column.as_column(maps).astype("int32")
gather_count = len(maps)
assert(gather_count == in_size)

cdef gdf_column** c_in_cols = cols_view_from_cols(in_cols)
cdef gdf_column** c_in_cols = cols_view_from_cols(in_cols, names)
cdef cudf_table* c_in_table = new cudf_table(c_in_cols, col_count)
cdef gdf_column* c_maps = column_view_from_column(maps)
cdef vector[cudf_table] c_out_tables
Expand All @@ -295,18 +303,24 @@ def scatter_to_frames(source, maps, index=None):
for tab in c_out_tables:
df = table_to_dataframe(&tab, int_col_names=False)
for name, cat_info in cats.items():
if is_categorical_dtype(df[name].dtype):
data_dtype = df[name].codes.dtype
else:
data_dtype = df[name].dtype
df[name] = Series(
CategoricalColumn(
data=df[name].data,
build_categorical_column(
categories=cat_info[0],
ordered=cat_info[1],
codes=df[name]._column,
ordered=cat_info[1]
)
)

if index:
print(index)
print(index_names)
print(df)
kkraus14 marked this conversation as resolved.
Show resolved Hide resolved
df = df.set_index(ind_names_tmp)
if len(index) == 1:
df.index.name = ind_names[0]
df.index.name = index_names[0]
out_tables.append(df)

free_table(c_in_table, c_in_cols)
Expand Down
3 changes: 1 addition & 2 deletions python/cudf/cudf/_lib/csv.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ cpdef write_csv(
if col_name not in cols:
raise NameError('column {!r} does not exist in DataFrame'
.format(col_name))
col = cols[col_name]._column
col = cols[col_name]
check_gdf_compatibility(col)
# Workaround for string columns
if col.dtype.type == np.object_:
Expand All @@ -308,7 +308,6 @@ cpdef write_csv(
list_cols.push_back(c_col)
else:
for idx, (col_name, col) in enumerate(cols.items()):
col = col._column
check_gdf_compatibility(col)
# Workaround for string columns
if col.dtype.type == np.object_:
Expand Down
24 changes: 10 additions & 14 deletions python/cudf/cudf/_lib/cudf.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ from libc.stdint cimport ( # noqa: E211
)
from libcpp.vector cimport vector

from cudf._libxx.column cimport Column

# Utility functions to build gdf_columns, gdf_context and error handling

cpdef get_ctype_ptr(obj)
Expand All @@ -31,20 +33,12 @@ cdef np_dtype_from_gdf_column(gdf_column* col)

cdef get_scalar_value(gdf_scalar scalar, dtype)

cdef gdf_column* column_view_from_column(col, col_name=*) except? NULL
cdef gdf_column* column_view_from_NDArrays(
size,
data,
mask,
dtype,
null_count
) except? NULL
cdef gdf_column* column_view_from_column(Column col, col_name=*) except? NULL
cdef gdf_scalar* gdf_scalar_from_scalar(val, dtype=*) except? NULL
cdef gdf_column_to_column(gdf_column* c_col, int_col_name=*)
cdef gdf_column_to_column_mem(gdf_column* input_col)
cdef update_nvstrings_col(col, uintptr_t category_ptr)
cdef gdf_column* column_view_from_string_column(col, col_name=*) except? NULL
cdef gdf_column** cols_view_from_cols(cols) except ? NULL
cdef Column gdf_column_to_column(gdf_column* c_col)
cdef gdf_column* column_view_from_string_column(Column col,
col_name=*) except? NULL
cdef gdf_column** cols_view_from_cols(cols, names=*) except ? NULL
cdef free_table(cudf_table* table0, gdf_column** cols=*)
cdef free_column(gdf_column* c_col)

Expand Down Expand Up @@ -382,4 +376,6 @@ cdef extern from "cudf/legacy/table.hpp" namespace "cudf" nogil:
# gdf_column const* const* end() const
# gdf_column const* get_column(size_type index) const except +

cpdef gdf_dtype gdf_dtype_from_value(col, dtype=*) except? GDF_invalid
cdef gdf_dtype gdf_dtype_from_dtype(dtype) except? GDF_invalid

cdef char* py_to_c_str(object py_str) except? NULL
Loading