-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a numpy engine for reading using numpy.genfromtxt() #452
Add a numpy engine for reading using numpy.genfromtxt() #452
Conversation
This commit adds the normal data section reader back into reader.py and also add some keyword arguments to LASFile.read. The behaviour is unchanged is engine == "normal" (default). If engine == "pandas", then: If the file is wrapped: if pandas_engine_wrapped_error is True (default): a LASDataError exception is raised. if False, a logger.warning message is emitted. The data section is then read using reader.py:read_data_section_iterative_pandas_engine() If an exception is raised in that function: If pandas_engine_error == "retry" (default): the data section will be re-read by the normal parser. Otherwise if it is "error": the exception will be raised. One problem is the pd.read_csv doesn't always raise an exception as we'd perhaps like it to.
This checkin is a quick hack to get an initial view of using numpy.genfromtext() for importing data sections. This checkin is based on the pandas-readcsv branch content and makes the following changes: - Set 'pandas' as the default engine. This is so we can run all the current tests with 'pandas (actually numpy.genfromtxt()) and get an intial view of any test failures. - Replace the actual 'pandas.read_csv(...)' call with numpy.genfromtxt()
Reverting to the normal engine for wrapped files brings it down to 18 test failures.
|
- If numpy-engine throws an exception on data-read then retry with the normal engine. - Remove '_iterative' from the names of the data-read engine functions.
Remaining test failures: FAILED tests/test_null_policy.py::test_null_policy_9999_aggressive - AssertionError: assert False |
Make these temporary changes to enable integrating the numpy-engine. - Route "aggressive" and "all" null_policies to the normal-engine. - Set tests that fail for numpy-engine to XFAIL. These test will continue to pass for the normal-engine. - First draft of useing genfromtxts' usemap and missing_values to align functionality with the normal-engine. This needs follow work.
df91e29
to
528bd81
Compare
@Boorhin and I tried to resolve the remaining test failures for the numpy-engine but they are going to take more time. So the latest checkin on dcslagel:numpy-genfromtxt-explore consists of temporary workarounds (routing some null_policy keys to normal-engine) and marking the failing tests with pytest.mark.xfail(). Xfail enables these test to run and pass for normal-engine but be ignored when tested with numpy-engine. Run pytest -rxXs to see these tests listed with a todo comment. I think we should squish this branch to one commit, set the default engine to 'normal' and merge it. That will enable folks to use the faster numpy engine when they configure for it. Then follow up finishing the work resolving the remaining test failures. How do you feel about this approach? Remaing failing tests:
Thanks!, |
So while i was waking up I found out what you could do in the short term to
fulfill the null policies
Polmask =np.array([np.where(array == X for X in NULL_POLICIES["all"] ]).
Sum()
Or something like that
Np.ma.masked_where(array, polmask)
It is not recommended to have NaN in an array. The strategy is to mask the
values you don't need. NaN can create unexpected results in calculations.
To mask invalid values there is a function like np.ma. Mask invalids or
something.
The other thing is that there will be only one NULL value per object. In a
vectorial approach it makes no sense to predefine NULL, you have to test if
the masking works but it is your object that should be tested not a bunch
of variables.
I am currently writing a parser for you it may take a while as I am thinly
stretched at the moment. But that aims at parsing all the LAS versions.
Cheers
…On Sat, 24 Apr 2021, 04:29 Kent Inverarity, ***@***.***> wrote:
Yes, I think that's a good approach. I still would like to change the read
and null policy and substitutions approach to align with a future release
where the numpy engine is default, but I'll do that in a separate PR.
Thanks @dcslagel <https://github.com/dcslagel> and @Boorhin
<https://github.com/Boorhin> for doing all this! 🎉
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#452 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJCEFUAEOQVN2AHCESZRZDTKI3IXANCNFSM43ES6M3A>
.
|
@dcslagel I've been working on the other major data-section-reader issue in PR #461 and it will conflict with this PR as we are working in the same part of the code. Happy to have a chat either here or there about which PR we merge first (this one or that one), I'm happy to do the work of merging them. |
I put one small change request in a review for #461. Other than that, #461 should go right in to the master/main branch without waiting for this pull-request. More thoughts about this pull-request (452):I've had 2nd thoughts about rushing to get it into master/main branch without all tests really passing. More generally, my concern is that adding another reader-engine will increase the maintenance workload. Here is the approach I'm currently leaning toward.
Does this seem like a good set of next steps? Or are there other steps that will work? |
I agree, let's do that! |
This merge hangs on the following test. tests/test_enhancements.py::test_autodepthindex_point_one_inch
- Remove unneeded 'run_normal_engine' - remove remove_data_line_filter - Move curve_data_gen Transform to numpy-eng - Transpose curve_data_gen to pass test_autodepthindex
- reshape-in-data-reader - Add test for not replacing NULL in index curve - Enable writing empty LAS file - Fix rounding issue when writing LAS file - Add Gitter Badge
The current commit accomplishes steps 1 -3 in #452 (comment) Here is the current set of 23 failing tests:
|
Commit 0f8b1bf
|
@kinverarity1 , @Boorhin, @donald-keighley, Commit 468a0c6 passes all the tests and retains the speed for the big file read. I wasn't able to completely replace the 'normal' parser with a 'numpy' parser for several reasons:
So the current program flow in this pull-request is:
This is some additional complexity and means maintaining both parsers but obtains the speed improvement for many well formed LAS files. Current Test Results:
-- Thank you, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dcslagel for getting this to mergeable state! 🎆
I think we should go ahead with this - there are ways to improve but at least it gives people a default boost in speed for most files.
Okay, Proceeding with the merge. First, I tested the merge in my local environment. All test pass and the speed test is: Thanks, |
Congratulations guys! |
Description:
This draft branch is to implement using
numpy.genfromtxt()
as an alternate data reader. It is work for issue #446Use an accelerated numpy or pandas reader
.Add data section reader which uses pandas.read_csv #450
read_data_section_iterative_numpy_engine()
Test Results:
Highlights:
Benchmark comparison with Master branch: numpy-genfromtxt-explore is significantly faster
Benchmark comparison with Pandas_readcsv branch: Pandas_read_csv is faster