Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a numpy engine for reading using numpy.genfromtxt() #452

Merged
merged 21 commits into from
Jun 28, 2021

Conversation

dcslagel
Copy link
Collaborator

@dcslagel dcslagel commented Apr 18, 2021

Description:

This draft branch is to implement using numpy.genfromtxt() as an alternate data reader. It is work for issue #446 Use an accelerated numpy or pandas reader.

  • It builds on the work in Add data section reader which uses pandas.read_csv #450
  • It defaults to engine="numpy" for testing purposes.
  • Adds and separate function : read_data_section_iterative_numpy_engine()
  • When a file indicated that it is wrapped, then it will be read with engine="normal" instead.

Test Results:

Highlights:

  • The speed test is significantly faster than the master branch
  • Overall test coverage is declining, only down 1% from 86% to 85%.
  • There are currently 21 test failures. These generally divide into AssertionErrors and lasio.expections.LASDataErrors.
---------- coverage: platform darwin, python 3.9.4-final-0 -----------
Name                       Stmts   Miss  Cover
----------------------------------------------
lasio/__init__.py             13      2    85%
lasio/convert_version.py      20     20     0%
lasio/defaults.py             11      0   100%
lasio/examples.py             42     10    76%
lasio/excel.py                88     34    61%
lasio/exceptions.py            6      0   100%
lasio/las.py                 451     65    86%
lasio/las_items.py           199     29    85%
lasio/las_version.py          50     14    72%
lasio/reader.py              446     45    90%
lasio/writer.py              171      9    95%
----------------------------------------------
TOTAL                       1497    228    85%

Benchmark comparison with Master branch: numpy-genfromtxt-explore is significantly faster
------------------------------------------------------------------------------------------------- benchmark: 2 tests -------------------------------------------------------------------------------------------------
Name (time in ms)                                  Min                   Max                  Mean            StdDev                Median               IQR            Outliers     OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_read_v12_sample_big (NOW)                366.4775 (1.0)        373.0084 (1.0)        370.3707 (1.0)      2.9769 (1.68)       371.9692 (1.0)      5.1055 (1.62)          1;0  2.7000 (1.0)           5           1
test_read_v12_sample_big (0001_1c1220f)     1,087.4714 (2.97)     1,091.4720 (2.93)     1,089.7887 (2.94)     1.7711 (1.0)      1,090.2695 (2.93)     3.1479 (1.0)           1;0  0.9176 (0.34)          5           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Benchmark comparison with Pandas_readcsv branch: Pandas_read_csv is faster
--------------------------------------------------------------------------------------------- benchmark: 2 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                                Min                 Max                Mean            StdDev              Median               IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_read_v12_sample_big (0001_bfa11ec)     176.7494 (1.0)      184.5382 (1.0)      180.4695 (1.0)      2.8277 (1.0)      180.5998 (1.0)      3.2284 (1.0)           2;0  5.5411 (1.0)           5           1
test_read_v12_sample_big (NOW)              358.8873 (2.03)     367.7130 (1.99)     361.8933 (2.01)     3.7042 (1.31)     360.7237 (2.00)     5.4340 (1.68)          1;0  2.7632 (0.50)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

kinverarity1 and others added 9 commits April 17, 2021 21:32
This commit adds the normal data section reader back into reader.py
and also add some keyword arguments to LASFile.read. The behaviour
is unchanged is engine == "normal" (default).

If engine == "pandas", then:

    If the file is wrapped:

        if pandas_engine_wrapped_error is True (default): a
            LASDataError exception is raised.
        if False, a logger.warning message is emitted.

    The data section is then read using reader.py:read_data_section_iterative_pandas_engine()

    If an exception is raised in that function:

        If pandas_engine_error == "retry" (default):
            the data section will be re-read by the normal parser.

        Otherwise if it is "error":
            the exception will be raised.

One problem is the pd.read_csv doesn't always raise an exception as we'd perhaps like it to.
This checkin is a quick hack to get an initial view of using
numpy.genfromtext() for importing data sections.
This checkin is based on the pandas-readcsv branch content and makes the
following changes:
- Set 'pandas' as the default engine. This is so we can run all the
  current tests with 'pandas (actually numpy.genfromtxt()) and get an
  intial view of any test failures.
- Replace the actual 'pandas.read_csv(...)' call with numpy.genfromtxt()
@kinverarity1
Copy link
Owner

Reverting to the normal engine for wrapped files brings it down to 18 test failures.

---------------------------------------------------- benchmark: 1 tests ---------------------------------------------------
Name (time in ms)                 Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------
test_read_v12_sample_big     436.3694  468.9999  450.1731  13.2622  448.1868  21.1479       2;0  2.2214       5           1
---------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=========================== short test summary info ============================
FAILED tests/test_null_policy.py::test_null_policy_9999_aggressive - Assertio...
FAILED tests/test_null_policy.py::test_null_policy_9999_all - AssertionError:...
FAILED tests/test_null_policy.py::test_null_policy_custom_1_caught_9998 - Ass...
FAILED tests/test_null_policy.py::test_null_policy_custom_2 - AssertionError:...
FAILED tests/test_null_policy.py::test_null_policy_ERR_strict - AssertionErro...
FAILED tests/test_null_policy.py::test_null_policy_runon_replaced_1 - lasio.e...
FAILED tests/test_null_policy.py::test_null_policy_runon_replaced_2 - lasio.e...
FAILED tests/test_null_policy.py::test_null_policy_runon_ok_1 - lasio.excepti...
FAILED tests/test_null_policy.py::test_null_policy_runon_ok_2 - lasio.excepti...
FAILED tests/test_null_policy.py::test_null_policy_small_non_zero_neg_nums - ...
FAILED tests/test_read.py::test_comma_decimal_mark_data - assert nan == 123.42
FAILED tests/test_read.py::test_missing_a_section - lasio.exceptions.LASDataE...
FAILED tests/test_read.py::test_blank_line_in_header - lasio.exceptions.LASDa...
FAILED tests/test_read.py::test_data_characters_1 - AssertionError: assert na...
FAILED tests/test_read.py::test_data_characters_2 - AssertionError: assert na...
FAILED tests/test_read.py::test_data_characters_types - AssertionError: asser...
FAILED tests/test_read.py::test_read_incorrect_shape - lasio.exceptions.LASDa...
FAILED tests/test_read.py::test_quoted_substrings_in_data_section - lasio.exc...
============ 18 failed, 217 passed, 2 skipped, 1 warning in 10.15s =============

@dcslagel dcslagel changed the title Experimental exploration with numpy.genfromtxt Add a numpy engine for reading using numpy.genfromtxt() Apr 22, 2021
@dcslagel dcslagel added the data-section-parser A bug or enhancement relating to the data section parser label Apr 22, 2021
- If numpy-engine throws an exception on data-read then retry with the
  normal engine.
- Remove '_iterative' from the names of the data-read engine functions.
@dcslagel
Copy link
Collaborator Author

Remaining test failures:

FAILED tests/test_null_policy.py::test_null_policy_9999_aggressive - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_9999_all - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_custom_1_caught_9998 - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_custom_2 - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_ERR_strict - AssertionError: assert nan == 'ERR'
FAILED tests/test_null_policy.py::test_null_policy_small_non_zero_neg_nums - AssertionError: assert False
FAILED tests/test_read.py::test_comma_decimal_mark_data - assert nan == 123.42
FAILED tests/test_read.py::test_data_characters_1 - AssertionError: assert nan == '00:00:00'
FAILED tests/test_read.py::test_data_characters_2 - AssertionError: assert nan == '01-Jan-20'
FAILED tests/test_read.py::test_data_characters_types - AssertionError: assert False

Make these temporary changes to enable integrating the numpy-engine.
- Route "aggressive" and "all" null_policies to the normal-engine.
- Set tests that fail for numpy-engine to XFAIL. These test will
  continue to pass for the normal-engine.
- First draft of useing genfromtxts' usemap and missing_values to align
  functionality with the normal-engine. This needs follow work.
@dcslagel dcslagel force-pushed the numpy-genfromtxt-explore branch from df91e29 to 528bd81 Compare April 23, 2021 23:08
@dcslagel
Copy link
Collaborator Author

dcslagel commented Apr 23, 2021

@Boorhin and I tried to resolve the remaining test failures for the numpy-engine but they are going to take more time. So the latest checkin on dcslagel:numpy-genfromtxt-explore consists of temporary workarounds (routing some null_policy keys to normal-engine) and marking the failing tests with pytest.mark.xfail(). Xfail enables these test to run and pass for normal-engine but be ignored when tested with numpy-engine. Run pytest -rxXs to see these tests listed with a todo comment.

I think we should squish this branch to one commit, set the default engine to 'normal' and merge it. That will enable folks to use the faster numpy engine when they configure for it. Then follow up finishing the work resolving the remaining test failures. How do you feel about this approach?

Remaing failing tests:

FAILED tests/test_null_policy.py::test_null_policy_ERR_strict - AssertionError: assert nan == 'ERR'
FAILED tests/test_read.py::test_comma_decimal_mark_data - assert nan == 123.42
FAILED tests/test_read.py::test_data_characters_1 - AssertionError: assert nan == '00:00:00'
FAILED tests/test_read.py::test_data_characters_2 - AssertionError: assert nan == '01-Jan-20'
FAILED tests/test_read.py::test_data_characters_types - AssertionError: assert False

Thanks!,
DC

@kinverarity1
Copy link
Owner

Yes, I think that's a good approach. I still would like to change the read and null policy and substitutions approach to align with a future release where the numpy engine is default, but I'll do that in a separate PR. Thanks @dcslagel and @Boorhin for doing all this! 🎉

@kinverarity1 kinverarity1 mentioned this pull request Apr 24, 2021
@Boorhin
Copy link

Boorhin commented Apr 24, 2021 via email

@kinverarity1
Copy link
Owner

@dcslagel I've been working on the other major data-section-reader issue in PR #461 and it will conflict with this PR as we are working in the same part of the code. Happy to have a chat either here or there about which PR we merge first (this one or that one), I'm happy to do the work of merging them.

@dcslagel
Copy link
Collaborator Author

dcslagel commented Apr 25, 2021

@kinverarity1,

I put one small change request in a review for #461. Other than that, #461 should go right in to the master/main branch without waiting for this pull-request.

More thoughts about this pull-request (452):

I've had 2nd thoughts about rushing to get it into master/main branch without all tests really passing. More generally, my concern is that adding another reader-engine will increase the maintenance workload.

Here is the approach I'm currently leaning toward.

  1. Revert
    528bd81 Numpy-engine temp workarounds for failing tests
    ( and a few other commits back to just after d3cea21 Remove pandas reader code)

  2. Merge Allow different data types per curve in data section reader #461 to master/main

  3. Merge master/main to this pull-request: 452

  4. Continue to work through the failing tests on 452 ( and syncing with master/main as needed). Once they are all passing, then assess whether we can completely replace the normal-reader engine with the numpy-reader engine. Or if we run into test we are not able to configure numpy-reader to pass, continue to look at/work on other options for improving read performance while maintaining Lasio capability and stability.

Does this seem like a good set of next steps? Or are there other steps that will work?

@kinverarity1
Copy link
Owner

I agree, let's do that!

dcslagel added 5 commits May 1, 2021 10:02
This merge hangs on the following test.
tests/test_enhancements.py::test_autodepthindex_point_one_inch
- Remove unneeded 'run_normal_engine'
- remove remove_data_line_filter
- Move curve_data_gen Transform to numpy-eng
- Transpose curve_data_gen to pass test_autodepthindex
- reshape-in-data-reader
- Add test for not replacing NULL in index curve
- Enable writing empty LAS file
- Fix rounding issue when writing LAS file
- Add Gitter Badge
@dcslagel dcslagel mentioned this pull request May 2, 2021
@dcslagel
Copy link
Collaborator Author

dcslagel commented May 2, 2021

The current commit accomplishes steps 1 -3 in #452 (comment)

Here is the current set of 23 failing tests:
These generally divide into AssertionErrors and lasio.expections.LASDataErrors.

======================================================= short test summary info ========================================================
FAILED tests/test_null_policy.py::test_null_policy_9999_aggressive - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_9999_all - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_custom_1_caught_9998 - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_custom_2 - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_ERR_strict - AssertionError: assert nan == 'ERR'
FAILED tests/test_null_policy.py::test_null_policy_runon_replaced_1 - lasio.exceptions.LASDataError: 
FAILED tests/test_null_policy.py::test_null_policy_runon_replaced_2 - lasio.exceptions.LASDataError: 
FAILED tests/test_null_policy.py::test_null_policy_runon_ok_1 - lasio.exceptions.LASDataError: 
FAILED tests/test_null_policy.py::test_null_policy_runon_ok_2 - lasio.exceptions.LASDataError: 
FAILED tests/test_null_policy.py::test_null_policy_small_non_zero_neg_nums - AssertionError: assert False
FAILED tests/test_read.py::test_comma_decimal_mark_data - assert nan == 123.42
FAILED tests/test_read.py::test_missing_a_section - lasio.exceptions.LASDataError: 
FAILED tests/test_read.py::test_blank_line_in_header - lasio.exceptions.LASDataError: 
FAILED tests/test_read.py::test_issue92 - TypeError: 'numpy.float64' object does not support item assignment
FAILED tests/test_read.py::test_data_characters_1 - AssertionError: assert nan == '00:00:00'
FAILED tests/test_read.py::test_data_characters_2 - AssertionError: assert nan == '01-Jan-20'
FAILED tests/test_read.py::test_data_characters_types - AssertionError: assert False
FAILED tests/test_read.py::test_read_incorrect_shape - lasio.exceptions.LASDataError: 
FAILED tests/test_read.py::test_quoted_substrings_in_data_section - lasio.exceptions.LASDataError: 
FAILED tests/test_read.py::test_sample_dtypes_specified - assert False
FAILED tests/test_read.py::test_sample_dtypes_specified_as_dict - assert False
FAILED tests/test_read.py::test_sample_dtypes_specified_as_false - assert False
FAILED tests/test_write.py::test_write_single_step - TypeError: 'numpy.float64' object does not support item assignment
====================================== 23 failed, 218 passed, 2 skipped, 8648 warnings in 10.36s =======================================

@dcslagel
Copy link
Collaborator Author

Commit 0f8b1bf

======================================================== short test summary info =========================================================
FAILED tests/test_null_policy.py::test_null_policy_9999_aggressive - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_9999_all - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_custom_1_caught_9998 - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_custom_2 - AssertionError: assert False
FAILED tests/test_null_policy.py::test_null_policy_ERR_strict - AssertionError: assert nan == 'ERR'
FAILED tests/test_null_policy.py::test_null_policy_runon_replaced_1 - lasio.exceptions.LASDataError: Traceback (most recent call last):
FAILED tests/test_null_policy.py::test_null_policy_runon_replaced_2 - lasio.exceptions.LASDataError: Traceback (most recent call last):
FAILED tests/test_null_policy.py::test_null_policy_runon_ok_1 - lasio.exceptions.LASDataError: Traceback (most recent call last):
FAILED tests/test_null_policy.py::test_null_policy_runon_ok_2 - lasio.exceptions.LASDataError: Traceback (most recent call last):
FAILED tests/test_null_policy.py::test_null_policy_small_non_zero_neg_nums - AssertionError: assert False
FAILED tests/test_read.py::test_comma_decimal_mark_data - assert nan == 123.42
FAILED tests/test_read.py::test_missing_a_section - lasio.exceptions.LASDataError: Traceback (most recent call last):
FAILED tests/test_read.py::test_blank_line_in_header - lasio.exceptions.LASDataError: Traceback (most recent call last):
FAILED tests/test_read.py::test_data_characters_1 - AssertionError: assert nan == '00:00:00'
FAILED tests/test_read.py::test_data_characters_2 - AssertionError: assert nan == '01-Jan-20'
FAILED tests/test_read.py::test_data_characters_types - AssertionError: assert False
FAILED tests/test_read.py::test_read_incorrect_shape - lasio.exceptions.LASDataError: Traceback (most recent call last):
FAILED tests/test_read.py::test_quoted_substrings_in_data_section - lasio.exceptions.LASDataError: Traceback (most recent call last):
FAILED tests/test_read.py::test_sample_dtypes_specified - assert False
FAILED tests/test_read.py::test_sample_dtypes_specified_as_dict - assert False
FAILED tests/test_read.py::test_sample_dtypes_specified_as_false - assert False
=============================================== 21 failed, 220 passed, 2 skipped in 10.42s ===============================================

@dcslagel
Copy link
Collaborator Author

dcslagel commented Jun 11, 2021

@kinverarity1 , @Boorhin, @donald-keighley,

Commit 468a0c6 passes all the tests and retains the speed for the big file read. I wasn't able to completely replace the 'normal' parser with a 'numpy' parser for several reasons:

  1. Only the normal parser handles wrapped files.
  2. When there is a custom null_policy, I haven't been able to workout how to get numpy to use the custom policy. It might be do-able, but I haven't figured it out so far...
  3. When there is a custom d-types (non-'auto'), then numpy's performance degrades to the level of the normal parser.

So the current program flow in this pull-request is:

  • Set the default parser to be numpy-parser
  • Check for the 1 - 3 above and if any of them are true fallback to the normal-parser
  • If the parser is still the numpy-parser attempt to run with it. If an exception is thrown then re-try with the normal parser.

This is some additional complexity and means maintaining both parsers but obtains the speed improvement for many well formed LAS files.

Current Test Results:

Name                       Stmts   Miss  Cover
----------------------------------------------
lasio/__init__.py             13      2    85%
lasio/convert_version.py      20     20     0%
lasio/defaults.py             11      0   100%
lasio/examples.py             42     10    76%
lasio/excel.py                88     34    61%
lasio/exceptions.py            6      0   100%
lasio/las.py                 457     65    86%
lasio/las_items.py           199     29    85%
lasio/las_version.py          50     14    72%
lasio/reader.py              446     28    94%
lasio/writer.py              171      9    95%
----------------------------------------------
TOTAL                       1503    211    86%
Coverage XML written to file coverage.xml


--------------------------------------------------- benchmark: 1 tests ------------------------------
Name (time in ms)                 Min       Max      Mean  StdDev    Median     IQR  Outliers     OPS
-----------------------------------------------------------------------------------------------------
test_read_v12_sample_big     335.3062  353.3825  341.6552  6.8935  339.8994  6.2280       1;0  2.9269
-----------------------------------------------------------------------------------------------------

--
Let me know if this change could be accepted (or rejected) or
needs some additional changes to be approved and merged.

Thank you,
DC

@dcslagel dcslagel marked this pull request as ready for review June 11, 2021 21:33
@dcslagel dcslagel requested a review from kinverarity1 June 11, 2021 21:33
Copy link
Owner

@kinverarity1 kinverarity1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dcslagel for getting this to mergeable state! 🎆

I think we should go ahead with this - there are ways to improve but at least it gives people a default boost in speed for most files.

@dcslagel
Copy link
Collaborator Author

Okay, Proceeding with the merge. First, I tested the merge in my local environment. All test pass and the speed test is: test_read_v12_sample_big 332.0824 Finalizing the merge via GitHub interface so that it will be signed with GitHubs verification signature.

Thanks,
DC

@dcslagel dcslagel merged commit f62686e into kinverarity1:master Jun 28, 2021
@Boorhin
Copy link

Boorhin commented Jun 29, 2021

Congratulations guys!
Would have liked to be more available but having a big project for now. However I learned a few things in numpy that may help for LASIO

@dcslagel dcslagel deleted the numpy-genfromtxt-explore branch July 20, 2022 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-section-parser A bug or enhancement relating to the data section parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants