small changes to get ready for pdf #261

blublinsky · 2024-06-11T12:12:33Z

Why are these changes needed?

There are currently 2 main issues for implementing pdf conversion:

PDF conversion is based on 2 files - *.csv and *.pdf. So we do need to process them as a pair. Both files have the same name but two different extensions. The solution is to only read *.csv file and then convert the name to get a pdf extension. Something like this:

pdf_name = TransformUtils.get_file_extension(file_name)[0] + ".pdf"

and then read a pdf file

As PDF is doing conversion from *.csv to *.parquet, the checkpointing is broken. Here checkpointing is fixed by:
a. introducing a new parameter files_to_checkpoint. default is [.parquet]
b. comparing just the full names without extensions for the purposes of checkpointing

dolfim-ibm

data-processing-lib/python/src/data_processing/transform/binary_transform.py

small changes to get ready for pdf

ef39bb3

blublinsky requested a review from daw3rd June 11, 2024 12:13

dolfim-ibm approved these changes Jun 11, 2024

View reviewed changes

daw3rd requested changes Jun 11, 2024

View reviewed changes

data-processing-lib/python/src/data_processing/transform/binary_transform.py Show resolved Hide resolved

blublinsky requested a review from daw3rd June 11, 2024 13:46

added optional parameter to transform

6b91a9a

daw3rd approved these changes Jun 11, 2024

View reviewed changes

daw3rd merged commit d468da0 into dev Jun 11, 2024
15 checks passed

This was referenced Jun 12, 2024

fix lang_id's test-src #267

Closed

add build-language job to build-images workflow #268

Merged