Skip to content

Commit

Permalink
WebDataset Reader (#230)
Browse files Browse the repository at this point in the history
* First NWC for webDatasets

* Add some funcs - NWC

* NCW

* cmake for tar file

* Fix Cmake issue

* Add WebDataset source readers

* Fix issues with soure readers

* NWC

* Parsing correct labels by adding the tar & file_stream utilities

* Changes to call the WebDataset source reader

* WebDataset source reader changes to include tar utils and stream utils

* Fix look up issue

* Fix the outputs when testing train.tar

* Fix image outputs for webdataset reader

* Add the support for index file parsing

* Add MetaData support for WebDataset reader to support storing ASCII values.

* Fixing the 2nd component outputs in WebDataset Reader

* Level 1 formatting

* Rename utilities functions

* Code clean up

* Add webdataset source evaluator

* Fix warnings and unused variables

* Name changes in webdataset source reader

* Add Missing component Behaviour

* Fix issues with webdataset source reader

* Minor changes

* Commit changes for webdataset reader.py file

* Add missing component behaviouour

* Add MissingComponentsBehaviour enum and python changes

* Fix missing components behavior empty and skip

* Multi-GPU support for webdataset reader

* Add reset loaders - auto_reset()

* Calling resize in crop resize

* Minor change for shuffle in webdataset reader

* Fix issue with WebDataset Reader for index files usage

* Resolve PR comments

* PR 11 comments

* Adding QA tests for webdataset reader

* Fix QA tests of cpp

* Minor change

* revert the crop resize commit - requires the PR for fused crop resize

* Minor change in webdataset example file

* Del webdataset example file

* Formatting changes

* Fix a minor indendation issue

* Minor changes

* Make the stick_to_shard True

* Fix the issue with webdataset PARTIAL policy

* Update .gitignore

* Update rocal_api_meta_data.h

* Update rocal_api_meta_data.cpp

* Update node_crop_resize.cpp

* Update node_crop_resize.cpp - no changes in file

* Update unit_tests.sh - remove extra comments

* LBP changes 1

* Working commit - LBP

* Working commit - LBP change 2

* crop resize addition

* Resolve PR comments - 1

* Temp NWC

* Fix the NWC

* Convert the vector to shared_ptr<vector> for AsciiComponent

* Update meta_data.h

* Update .gitignore

* PRR comments fetched

* Pr comments resolution

* Resolving multiple pr comments

* add eof

* Update unit tests.sh

* Move the tar_helper_functions to helpers folder

* Add tar helper files

* Minor change - CmakeLists

* Fix throw for missing component behaviour

* Resolve PR comments

* Handle the MISSING COMPOENENT BEHAVIOUR of EMPTY correctly

* Fix the build error

* Minor Fix

* Add setup instructions for libtar in rocAL-setup.py

* Resolving review comment

* Removed comment

* Resolving review comment and changing version number

* Fix the Ascii metadata wrt rocalListOfTensorList

* Fix unit test for webdataset reader

* Resolving review comments and unit test changes

* Fix ascii metadata tensor list vector

* Fix iterator for partial policy

* Minor changes

* Adding CMake flag for wds reader

* Minor changes

* Adding reset_mem_handle API

---------

Co-authored-by: fiona-gladwin <[email protected]>
Co-authored-by: Kiriti Gowda <[email protected]>
Co-authored-by: Sundar Rajan Vaithiyanathan <[email protected]>
Co-authored-by: SundarRajan28 <[email protected]>
  • Loading branch information
5 people authored Jan 14, 2025
1 parent 23103d5 commit a846299
Show file tree
Hide file tree
Showing 37 changed files with 2,072 additions and 42 deletions.
63 changes: 63 additions & 0 deletions cmake/FindLibTar.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
################################################################################
#
# MIT License
#
# Copyright (c) 2024 Advanced Micro Devices, Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#
################################################################################
find_path(LIBTAR_INCLUDE_DIRS
NAMES libtar.h
HINTS
$ENV{LIBTAR_PATH}/include
PATHS
/usr/include
/usr/local/include
)
mark_as_advanced(LIBTAR_INCLUDE_DIRS)

find_library(LIBTAR_LIBRARIES
NAMES libtar.a tar libtar
HINTS
$ENV{LIBTAR_PATH}/lib
$ENV{LIBTAR_PATH}/lib64
PATHS ${CMAKE_SYSTEM_PREFIX_PATH} ${LIBTAR_PATH} "/usr/local" "/usr/lib"
PATH_SUFFIXES lib lib64)

mark_as_advanced(LIBTAR_LIBRARIES)

if(LIBTAR_LIBRARIES AND LIBTAR_INCLUDE_DIRS)
message("-- ${White}Using Libtar -- \n\tLibraries:${LIBTAR_LIBRARIES} \n\tIncludes:${LIBTAR_INCLUDE_DIRS}${ColourReset}")
set(LIBTAR_FOUND TRUE)
else()
message( "-- ${Yellow}NOTE: FindLibTar failed to find -- LibTar${ColourReset}" )
endif()

include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(LibTar
FOUND_VAR LIBTAR_FOUND
REQUIRED_VARS
LIBTAR_LIBRARIES
LIBTAR_INCLUDE_DIRS
)

set(LIBTAR_FOUND ${LIBTAR_FOUND} CACHE INTERNAL "")
set(LIBTAR_LIBRARIES ${LIBTAR_LIBRARIES} CACHE INTERNAL "")
set(LIBTAR_INCLUDE_DIRS ${LIBTAR_INCLUDE_DIRS} CACHE INTERNAL "")
6 changes: 6 additions & 0 deletions rocAL-setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,12 @@ def ERROR_CHECK(waitval):
os.system('(cd '+deps_dir+'; git clone https://github.com/Tencent/rapidjson.git; cd rapidjson; mkdir build; cd build; ' +
linuxCMake+' ../; make -j$(nproc); sudo make install)')

# libtar - https://github.com/tklauser/libtar.git
ERROR_CHECK(os.system(
'(cd '+deps_dir+'; git clone https://github.com/tklauser/libtar.git )'))
ERROR_CHECK(os.system('(cd '+deps_dir+'/libtar; '+
' autoreconf --force --install; CFLAGS="-fPIC" ./configure; make -j$(nproc); sudo make install )'))

# Optional Deps
if "Ubuntu" in platformInfo:
for i in range(len(debianOptionalPackages)):
Expand Down
10 changes: 10 additions & 0 deletions rocAL/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ find_package(RapidJSON QUIET)
find_package(StdFilesystem QUIET)
find_package(HALF QUIET)
find_package(SndFile QUIET)
find_package(LibTar QUIET)

# HIP Backend
if(GPU_SUPPORT AND "${BACKEND}" STREQUAL "HIP")
Expand Down Expand Up @@ -343,6 +344,15 @@ if(${BUILD_ROCAL})
else()
message("-- ${Yellow}NOTE: rocAL built without Audio support - Audio Functionalities will not be enabled${ColourReset}")
endif()
# LibTar
if(LIBTAR_FOUND)
include_directories(${LIBTAR_INCLUDE_DIRS})
set(LINK_LIBRARY_LIST ${LINK_LIBRARY_LIST} ${LIBTAR_LIBRARIES})
message("-- ${White}NOTE: rocAL built with LibTar support - WebDataset reader enabled${ColourReset}")
target_compile_definitions(${PROJECT_NAME} PUBLIC -DENABLE_WDS)
else()
message("-- ${Yellow}NOTE: rocAL built without LibTar - WebDataset reader will not be supported${ColourReset}")
endif()
# -Wall -- Enable most warning messages
# -mavx2 -- Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX and AVX2 built-in functions and code generation
# -mfma -- Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX and FMA built-in functions and code generation
Expand Down
32 changes: 32 additions & 0 deletions rocAL/include/api/rocal_api_data_loaders.h
Original file line number Diff line number Diff line change
Expand Up @@ -940,4 +940,36 @@ extern "C" RocalTensor ROCAL_API_CALL rocalAudioFileSourceSingleShard(RocalConte
unsigned max_decoded_channels = 0,
RocalShardingInfo rocal_sharding_info = RocalShardingInfo());

/*! Creates WebDataset tar files reader and decoder. It allocates the resources and objects required to read and decode files in webdataset format stored on the file systems. It has internal sharding capability to load/decode in parallel is user wants.
* \param [in] context Rocal context
* \param [in] source_path A NULL terminated char string pointing to the location of files on the disk
* \param [in] index_path A NULL terminated char string pointing to the location of index files on the disk
* \param [in] rocal_color_format The color format the images will be decoded to.
* \param [in] shard_id Shard id for this loader.
* \param [in] shard_count Defines the parallelism level by internally sharding the input dataset and load/decode using multiple decoder/loader instances. Using shard counts bigger than 1 improves the load/decode performance if compute resources (CPU cores) are available.
* \param [in] is_output Boolean variable to enable the audio to be part of the output.
* \param [in] shuffle Boolean variable to shuffle the dataset.
* \param [in] loop Boolean variable to indefinitely loop through audio.
* \param [in] decode_size_policy is the RocalImageSizeEvaluationPolicy for decoding
* \param [in] max_width The maximum width of the decoded image files, larger or smaller will be resized to closest
* \param [in] max_height The maximum height of the decoded image files, larger or smaller will be resized to closest
* \param [in] rocal_decoder_type Determines the decoder_type - image / video / audio
* \param [in] rocal_sharding_info The members of RocalShardingInfo determines how the data is distributed among the shards and how the last batch is processed by the pipeline.
* \return Reference to the output tensor
*/
extern "C" RocalTensor ROCAL_API_CALL rocalWebDatasetSourceSingleShard(RocalContext p_context,
const char* source_path,
const char* index_path,
RocalImageColor rocal_color_format,
unsigned shard_id,
unsigned shard_count,
bool is_output,
bool shuffle = false,
bool loop = false,
RocalImageSizeEvaluationPolicy decode_size_policy = ROCAL_USE_MAX_SIZE,
unsigned max_width = 0,
unsigned max_height = 0,
RocalDecoderType dec_type = RocalDecoderType::ROCAL_DECODER_TJPEG,
RocalShardingInfo rocal_sharding_info = RocalShardingInfo());

#endif // MIVISIONX_ROCAL_API_DATA_LOADERS_H
21 changes: 21 additions & 0 deletions rocAL/include/api/rocal_api_meta_data.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ THE SOFTWARE.
#ifndef MIVISIONX_ROCAL_API_META_DATA_H
#define MIVISIONX_ROCAL_API_META_DATA_H
#include "rocal_api_types.h"
#include <set>

/*!
* \file
Expand Down Expand Up @@ -316,4 +317,24 @@ extern "C" void ROCAL_API_CALL rocalBoxIouMatcher(RocalContext p_context, std::v
*/
extern "C" RocalTensorList ROCAL_API_CALL rocalGetMatchedIndices(RocalContext p_context);

/*! \brief creates webdataset reader
* \ingroup group_rocal_meta_data
* \param [in] p_context rocal context
* \param [in] source_path path to the folder that contains the dataset
* \param [in] index_path path to the folder that contains the index files
* \param extensions [in] the extensions used in the tar files for parsing them
* \param missing_components_behavior [in] The behaviour that determines what happens when any component in the sample is missing.
* \param is_output [in] The output is set or not.
* \return RocalMetaData object, can be used to inquire about the rocal's output (processed) tensors
*/
extern "C" RocalMetaData ROCAL_API_CALL rocalCreateWebDatasetReader(RocalContext p_context, const char* source_path, const char* index_path,
std::vector<std::set<std::string>> extensions, RocalMissingComponentsBehaviour missing_components_behavior, bool is_output);

/*! \brief get joints data pointer
* \ingroup group_rocal_meta_data
* \param [in] rocal_context rocal context
* \param [out] ascii_data The user's AsciiDatas pointer that will be pointed to AsciiDataBatch pointer
*/
RocalMetaData ROCAL_API_CALL rocalGetAsciiDatas(RocalContext p_context);

#endif // MIVISIONX_ROCAL_API_META_DATA_H
15 changes: 15 additions & 0 deletions rocAL/include/api/rocal_api_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -483,4 +483,19 @@ struct RocalShardingInfo {
shard_size(shard_size) {}
};

/*! \brief Missing components behaviour for Webdataset
* \ingroup group_rocal_types
*/
enum RocalMissingComponentsBehaviour {
/*! \brief ROCAL_MISSING_COMPONENT_ERROR
*/
ROCAL_MISSING_COMPONENT_ERROR = 0,
/*! \brief ROCAL_MISSING_COMPONENT_SKIP
*/
ROCAL_MISSING_COMPONENT_SKIP = 1,
/*! \brief ROCAL_MISSING_COMPONENT_EMPTY
*/
ROCAL_MISSING_COMPONENT_EMPTY = 2
};

#endif // MIVISIONX_ROCAL_API_TYPES_H
79 changes: 79 additions & 0 deletions rocAL/include/helpers/tar_helper_functions.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
/*
Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/

#ifdef ENABLE_WDS
#include <fstream>
#include <memory>
#include <string>
#include <libtar.h>

constexpr size_t kBlockSize = T_BLOCKSIZE;
// Refactor TarArchive to use std::ifstream
class TarArchive {
public:
TarArchive() = default;
explicit TarArchive(std::unique_ptr<std::ifstream> stream);
TarArchive(TarArchive &&);
~TarArchive();
TarArchive &operator=(TarArchive &&);
bool advance_to_next_file_in_tar();
bool at_end_of_archive() const;
void seek_to_offset_in_archive(int64_t offset);
int64_t get_current_archive_offset() const;
int64_t get_current_header_size() const;
std::ifstream* get_stream();

enum EntryType {
ENTRY_NONE = 0,
ENTRY_FILE,
ENTRY_DIR,
ENTRY_HARDLINK,
ENTRY_SYMLINK,
ENTRY_CHARDEV,
ENTRY_BLOCKDEV,
ENTRY_FIFO,
ENTRY_NOT_DEFINED
};

const std::string& get_current_file_name() const;
size_t get_current_file_size() const;
EntryType get_current_file_type() const;
std::shared_ptr<void> read_current_file();
size_t read_into_buffer(void *buffer, size_t count);
bool is_end_of_file() const;
std::unique_ptr<std::ifstream> release_file_stream();

private:
std::unique_ptr<std::ifstream> _stream; // Using std::ifstream directly
int _instance_handle = -1;
void* _handle = nullptr;
friend ssize_t read_tar_archive(int, void *, size_t);
std::string _filename;
size_t _filesize = 0;
EntryType _filetype = ENTRY_NONE;
size_t _readoffset = 0;
int64_t _current_header = 0;
bool _eof = true;
void mark_end_of_file();
void parse_current_header();
};
#endif
5 changes: 3 additions & 2 deletions rocAL/include/loaders/image/node_image_loader.h
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,9 @@ class ImageLoaderNode : public Node {
/// \param load_batch_count Defines the quantum count of the images to be loaded. It's usually equal to the user's batch size.
/// The loader will repeat images if necessary to be able to have images in multiples of the load_batch_count,
/// for example if there are 10 images in the dataset and load_batch_count is 3, the loader repeats 2 images as if there are 12 images available.
void init(unsigned internal_shard_count, unsigned cpu_num_threads, const std::string &source_path, const std::string &json_path, const std::map<std::string, std::string> feature_key_map, StorageType storage_type, DecoderType decoder_type, bool shuffle, bool loop,
size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader, bool decoder_keep_orig = false, const ShardingInfo& sharding_info = ShardingInfo(), const char *prefix = "", unsigned sequence_length = 0, unsigned step = 0, unsigned stride = 0, ExternalSourceFileMode external_file_mode = ExternalSourceFileMode::NONE);
void init(unsigned internal_shard_count, unsigned cpu_num_threads, const std::string &source_path, const std::string &json_path, const std::map<std::string, std::string> feature_key_map, StorageType storage_type,
DecoderType decoder_type, bool shuffle, bool loop, size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader, bool decoder_keep_orig = false, const ShardingInfo& sharding_info = ShardingInfo(),
const char *prefix = "", unsigned sequence_length = 0, unsigned step = 0, unsigned stride = 0, ExternalSourceFileMode external_file_mode = ExternalSourceFileMode::NONE, const std::string &index_path = "");

std::shared_ptr<LoaderModule> get_loader_module();

Expand Down
6 changes: 3 additions & 3 deletions rocAL/include/loaders/image/node_image_loader_single_shard.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,9 @@ class ImageLoaderSingleShardNode : public Node {
/// \param load_batch_count Defines the quantum count of the images to be loaded. It's usually equal to the user's batch size.
/// The loader will repeat images if necessary to be able to have images in multiples of the load_batch_count,
/// for example if there are 10 images in the dataset and load_batch_count is 3, the loader repeats 2 images as if there are 12 images available.
void init(unsigned shard_id, unsigned shard_count, unsigned cpu_num_threads, const std::string &source_path, const std::string &json_path, StorageType storage_type, DecoderType decoder_type,
bool shuffle, bool loop, size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader, bool decoder_keep_orig = false, const ShardingInfo& sharding_info = ShardingInfo(),
const std::map<std::string, std::string> feature_key_map = std::map<std::string, std::string>(), unsigned sequence_length = 0, unsigned step = 0, unsigned stride = 0, ExternalSourceFileMode external_file_mode = ExternalSourceFileMode::NONE);
void init(unsigned shard_id, unsigned shard_count, unsigned cpu_num_threads, const std::string &source_path, const std::string &json_path, StorageType storage_type, DecoderType decoder_type,
bool shuffle, bool loop, size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader, bool decoder_keep_orig = false, const ShardingInfo& sharding_info = ShardingInfo(),
const std::map<std::string, std::string> feature_key_map = std::map<std::string, std::string>(), unsigned sequence_length = 0, unsigned step = 0, unsigned stride = 0, ExternalSourceFileMode external_file_mode = ExternalSourceFileMode::NONE, const std::string &index_path = "");

std::shared_ptr<LoaderModule> get_loader_module();

Expand Down
Loading

0 comments on commit a846299

Please sign in to comment.