Last Batch Policy changes for file source reader (#182)

* Remove the parse_config * Adding missed param in python unit tests * Fix error - Too many open files * Fix slice fill values copy issue * Fix file list path in API dataloader * Revert "Add Glob to CMakeLists.txt" This reverts commit 47263d9. * Fix include headers for Audio files * Fix copy data 2D * Minor changes * Pass decoded data info to load routine instead of separate vectors * Update CHANGELOG.md * Update CHANGELOG.md * Change swap_handle_time variable name in loader * Update the changelog.md * Update ChangeLog.md * Update ChangeLog.md * Update CHANGELOG.md * Formatting changes Add comments * Update doxygen comments * Move file source reader from readers/image to readers folder * Update README and add doxygen description * Update CMakeLists and README for audio test * Update README for audio test * Minor fix * Fix build errors * Fix Copy_Data_2d_ROI * Fix merge from PR 2 * Minor changes shard_count argument name * Rename set and get functions of data_info to decoded_data_info * Fix shard_size and audio source evaluation * Changes in file_source_reader - to minimize the I/O operations * Changes in the variable name * Changes in the variable names of the audio source evalution * Use set instead of vector * Minor bug fixes * Minor fixes Remove changes to update the filenames vector incase of padding Fix the pipeline when pad is off and use idx in a continuous manner * Fix drop policy without padding To skip and batch and start with the previously padded idx in last batch * Fix pytorch iterator - PARTIAL policy * Revert empty line removed in CMakeLists.txt * Removed prefix original for audio vectors * Fix PARTIAL Fix issue with batch size greater than dataset size * Reduce overall time for audio source evalution * Fix shard_size and stick to shard issue seen with convergence * Resolve PR comments * Add @params to all args in pytorch.py * Fix build issue * Minor changes in unit test * Minor changes * Change ROCAL instaces to rocAL in pytorch.py * Resolve the PR comments * Minor changes in decoders.py - Modify the comment for shard_size * Fix shard_size * Minor changes * Changes in pipeline.py and decoders.py * Address the PR comments * Address Review comments * Remove print statement * Fix the count_items * Make Sharding similar to DALI * Fix issues with DROP policy by introducing a new vector for padding * Minor fixes * Comment out print statements * Add changes for shard_size LBP testing * Fix DROP Policy with shard_size > 0 * Fix Stick_to_Shard=False * Fix PARTIAL policy and code clean up * fix last_batch_padded size when shard_size > 0 * Fix Drop policy - we skip the dropped batch in the next epoch * Fix single shard outputs * Remove the commented code and fix the padding code in open() * Remove div by num_shards in decoders.py * Introduce Audio layouts * Add layout changes for spectrogram * Fix the unit tests - c++ & python * Code clean up and formatting * Minor code clean up * code clean up in pytorch.py * Add layout changes for spectrogram * Pass layouts for MelFilterBank * Fix ToDecibels Pass layouts for ToDecibels * Fix Normalize * Fix build issue * Fix python unit test * Minor fix * Pass LBP to decoders instead of the Pipeline creation * Update pipeline.py - Remove commented code * Update pipeline.py - Remove commented out code * Adding changes for spec layout changes * Adding changes to MFB and normalize nodes * Update node_slice.cpp * Update node_slice.h * Resolve PR comments * Fix downmix failing case and resolve the issue with merge * Fix issue with file_source_reader.cpp when file_list is not used * Resolve PR comments - Sundar * Fix file_source_reader.cpp * Fix shuffle issues * Adding comments to all if conditions * Fix merge conflicts * Resolving review comments * Fix a minor warning in file source reader * Resolving review comments * LBP comments resolution * Resolving review comments * Formatting changes * Resolving Final Set of PR comments * Combine with OR condition * Remove the pad_last_batch_repeated print statement from decoders.py * Add shard_size and stick_to_shard variables in args * Minor spelling fix * Make changes to insert the padded data in the file_names vector * Support to pass the variables fo lbp as struct * Fix segmentation fault * Resolve PR comments * Resolve PR comments * Resolve PR comments * Use PreComputed start and end indices * Use precomputed shard_idx start and end in initialize * Initialize the Sharding info using ShardingInfo() * convert the signed to int32_t type * temp commit for struct changes * Fix the struct changes - All the test cases passing * Remove any print statements * Add support to Pass the decode size policy from the user * Add support to Pass the decode size policy from the user * Rename RocalShardingInfo for ShardingInfo and vice-versa * xywh roi copy * Fix decoders.py for image decoders * Make stick_to_shard True by default * Minor changes to the copy_data function * Rename to x_offset and y_offset in copy_data * Minor changes - remove unused variables * Minor change - Variable names * Update Doxygen comments and comments of API * Make the rocalShardingInfo as the last param for Audio loaders * Remove unused variables and functions in file_source_reader.cpp & .h files * Remove the doctring explanation for unused params * Change the explanation according to the newly introduced structure --------- Co-authored-by: SundarRajan28 <[email protected]> Co-authored-by: fiona-gladwin <[email protected]> Co-authored-by: Swetha B S <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: SundarRajan98 <[email protected]> Co-authored-by: sbavasab <[email protected]> Co-authored-by: Lakshmi Kumar <[email protected]> Co-authored-by: Kiriti Gowda <[email protected]> Co-authored-by: Sundar Rajan Vaithiyanathan <[email protected]> Co-authored-by: Swetha B S <[email protected]> Co-authored-by: fgladwin <[email protected]> Co-authored-by: Swetha B S <[email protected]>
ROCm · Sep 11, 2024 · 87348ad · 87348ad
1 parent afdffd7
commit 87348ad
Show file tree

Hide file tree

Showing 38 changed files with 677 additions and 413 deletions.
diff --git a/rocAL/include/api/rocal_api_data_loaders.h b/rocAL/include/api/rocal_api_data_loaders.h
diff --git a/rocAL/include/api/rocal_api_info.h b/rocAL/include/api/rocal_api_info.h
@@ -133,7 +133,7 @@ extern "C" TimingInfo ROCAL_API_CALL rocalGetTimingInfo(RocalContext rocal_conte
  * \brief Retrieves the information about the size of the last batch.
  * \ingroup group_rocal_info
  * \param rocal_context
- * \return The number of samples that were padded in the last batch in adherence with last_batch_policy and last_batch_padded
+ * \return The number of samples that were padded in the last batch in adherence with last_batch_policy and pad_last_batch_repeated.
  */
 extern "C" size_t ROCAL_API_CALL rocalGetLastBatchPaddedSize(RocalContext rocal_context);
 

diff --git a/rocAL/include/api/rocal_api_tensor.h b/rocAL/include/api/rocal_api_tensor.h
@@ -40,6 +40,7 @@ class rocalTensor {
     virtual ~rocalTensor() = default;
     virtual void* buffer() = 0;
     virtual unsigned copy_data(void* user_buffer, RocalOutputMemType external_mem_type = ROCAL_MEMCPY_HOST) = 0;
+    virtual unsigned copy_data(void* user_buffer, uint x_offset, uint y_offset, uint max_cols, uint max_rows) = 0; // Copy only the ROI to the user_buffer [The padded region is not copied]
     virtual unsigned num_of_dims() = 0;
     virtual unsigned batch_size() = 0;
     virtual std::vector<size_t> dims() = 0;

diff --git a/rocAL/include/api/rocal_api_types.h b/rocAL/include/api/rocal_api_types.h
@@ -438,7 +438,7 @@ enum RocalMelScaleFormula {
     ROCAL_MELSCALE_HTK
 };
 
-/*! \brief Tensor Last Batch Policies
+/*! \brief Tensor Last Batch Policy Type enum
  *  \ingroup group_rocal_types
  */
 enum RocalLastBatchPolicy {
@@ -448,9 +448,27 @@ enum RocalLastBatchPolicy {
     /*! \brief ROCAL_LAST_BATCH_DROP - The last batch is dropped if there are not enough samples from the current epoch.
      */
     ROCAL_LAST_BATCH_DROP = 1,
-    /*! \brief ROCAL_LAST_BATCH_PARTIAL - The last batch is partially filled with the remaining data from the current epoch, keeping the rest of the samples empty. (currently this policy works similar to FILL in rocAL, PARTIAL policy needs to be handled from python end)
+    /*! \brief ROCAL_LAST_BATCH_PARTIAL - The last batch is partially filled with the remaining data from the current epoch, keeping the rest of the samples empty. (currently this policy works similar to FILL in rocAL, PARTIAL policy needs to be handled in the python iterator)
      */
     ROCAL_LAST_BATCH_PARTIAL = 2
 };
 
+/*! \brief  rocAL RocalShardingInfo enum
+ * \ingroup group_rocal_types
+ */
+struct RocalShardingInfo {
+    RocalLastBatchPolicy last_batch_policy;
+    bool pad_last_batch_repeated;
+    bool stick_to_shard;
+    int32_t shard_size;
+
+    // Constructor with default values
+    RocalShardingInfo()
+        : last_batch_policy(RocalLastBatchPolicy::ROCAL_LAST_BATCH_FILL),
+          pad_last_batch_repeated(false),
+          stick_to_shard(true),
+          shard_size(-1)
+    {}
+};
+
 #endif  // MIVISIONX_ROCAL_API_TYPES_H
diff --git a/rocAL/include/loaders/audio/audio_loader.h b/rocAL/include/loaders/audio/audio_loader.h
@@ -56,6 +56,7 @@ class AudioLoader : public LoaderModule {
     void feed_external_input(const std::vector<std::string>& input_images_names, const std::vector<unsigned char*>& input_buffer,
                              const std::vector<ROIxywh>& roi_xywh, unsigned int max_width, unsigned int max_height, unsigned int channels,
                              ExternalSourceFileMode mode, bool eos) override { THROW("external source feed is not supported in audio loader") }
+    size_t last_batch_padded_size() override;
 
    private:
     bool is_out_of_data();

diff --git a/rocAL/include/loaders/audio/audio_loader_sharded.h b/rocAL/include/loaders/audio/audio_loader_sharded.h
@@ -45,6 +45,7 @@ class AudioLoaderSharded : public LoaderModule {
     void feed_external_input(const std::vector<std::string>& input_images_names, const std::vector<unsigned char*>& input_buffer,
                              const std::vector<ROIxywh>& roi_xywh, unsigned int max_width, unsigned int max_height, unsigned int channels, 
                              ExternalSourceFileMode mode, bool eos) override { THROW("external source feed is not supported in audio loader") }
+    size_t last_batch_padded_size() override;
 
    private:
     void increment_loader_idx();

diff --git a/rocAL/include/loaders/audio/audio_read_and_decode.h b/rocAL/include/loaders/audio/audio_read_and_decode.h
@@ -64,6 +64,7 @@ class AudioReadAndDecode {
         const size_t max_decoded_channels);
     // returns timing info or other status information
     Timing GetTiming();
+    size_t last_batch_padded_size(); // The number of padded samples in the last batch
 
    private:
     std::vector<std::shared_ptr<AudioDecoder>> _decoder;

diff --git a/rocAL/include/loaders/audio/node_audio_loader.h b/rocAL/include/loaders/audio/node_audio_loader.h
@@ -44,11 +44,13 @@ class AudioLoaderNode : public Node {
     /// \param load_batch_count Defines the quantum count of the Audios to be loaded. It's usually equal to the user's batch size.
     /// \param mem_type Memory type, host or device
     /// \param meta_data_reader Determines the meta-data information
+    /// \param sharding_info The members of ShardingInfo determines how the data is distributed among the shards and how the last batch is processed by the pipeline.
     /// The loader will repeat Audios if necessary to be able to have Audios in multiples of the load_batch_count,
     /// for example if there are 10 Audios in the dataset and load_batch_count is 3, the loader repeats 2 Audios as if there are 12 Audios available.
     void Init(unsigned internal_shard_count, unsigned cpu_num_threads, const std::string &source_path,
               const std::string &file_list_path, StorageType storage_type, DecoderType decoder_type, bool shuffle, bool loop,
-              size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader);
+              size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader,
+              const ShardingInfo& sharding_info);
     std::shared_ptr<LoaderModule> GetLoaderModule();
 
    protected:

diff --git a/rocAL/include/loaders/audio/node_audio_loader_single_shard.h b/rocAL/include/loaders/audio/node_audio_loader_single_shard.h
@@ -42,11 +42,13 @@ class AudioLoaderSingleShardNode : public Node {
     /// \param load_batch_count Defines the quantum count of the Audios to be loaded. It's usually equal to the user's batch size.
     /// \param mem_type Memory type, host or device
     /// \param meta_data_reader Determines the meta-data information
+    /// \param sharding_info The members of ShardingInfo determines how the data is distributed among the shards and how the last batch is processed by the pipeline.
     /// The loader will repeat Audios if necessary to be able to have Audios in multiples of the load_batch_count,
     /// for example if there are 10 Audios in the dataset and load_batch_count is 3, the loader repeats 2 Audios as if there are 12 Audios available.
     void Init(unsigned shard_id, unsigned shard_count, unsigned cpu_num_threads, const std::string &source_path,
               const std::string &file_list_path, StorageType storage_type, DecoderType decoder_type, bool shuffle,
-              bool loop, size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader);
+              bool loop, size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader,
+              const ShardingInfo& sharding_info);
     std::shared_ptr<LoaderModule> GetLoaderModule();
 
    protected:

diff --git a/rocAL/include/loaders/image/node_fused_jpeg_crop.h b/rocAL/include/loaders/image/node_fused_jpeg_crop.h
@@ -42,7 +42,7 @@ class FusedJpegCropNode : public Node {
     /// for example if there are 10 images in the dataset and load_batch_count is 3, the loader repeats 2 images as if there are 12 images available.
     void init(unsigned internal_shard_count, unsigned cpu_num_threads, const std::string &source_path, const std::string &json_path, StorageType storage_type,
               DecoderType decoder_type, bool shuffle, bool loop, size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader,
-              unsigned num_attempts, std::vector<float> &random_area, std::vector<float> &random_aspect_ratio, std::pair<RocalBatchPolicy, bool> last_batch_info = {RocalBatchPolicy::FILL, true});
+              unsigned num_attempts, std::vector<float> &random_area, std::vector<float> &random_aspect_ratio, const ShardingInfo& sharding_info = ShardingInfo());
 
     std::shared_ptr<LoaderModule> get_loader_module();
 

diff --git a/rocAL/include/loaders/image/node_fused_jpeg_crop_single_shard.h b/rocAL/include/loaders/image/node_fused_jpeg_crop_single_shard.h
@@ -39,7 +39,7 @@ class FusedJpegCropSingleShardNode : public Node {
     /// for example if there are 10 images in the dataset and load_batch_count is 3, the loader repeats 2 images as if there are 12 images available.
     void init(unsigned shard_id, unsigned shard_count, unsigned cpu_num_threads, const std::string &source_path, const std::string &json_path, StorageType storage_type,
               DecoderType decoder_type, bool shuffle, bool loop, size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader,
-              unsigned num_attempts, std::vector<float> &random_area, std::vector<float> &random_aspect_ratio, std::pair<RocalBatchPolicy, bool> last_batch_info = {RocalBatchPolicy::FILL, true});
+              unsigned num_attempts, std::vector<float> &random_area, std::vector<float> &random_aspect_ratio, const ShardingInfo& sharding_info = ShardingInfo());
 
     std::shared_ptr<LoaderModule> get_loader_module();
 

diff --git a/rocAL/include/loaders/image/node_image_loader.h b/rocAL/include/loaders/image/node_image_loader.h
@@ -40,7 +40,7 @@ class ImageLoaderNode : public Node {
     /// The loader will repeat images if necessary to be able to have images in multiples of the load_batch_count,
     /// for example if there are 10 images in the dataset and load_batch_count is 3, the loader repeats 2 images as if there are 12 images available.
     void init(unsigned internal_shard_count, unsigned cpu_num_threads, const std::string &source_path, const std::string &json_path, const std::map<std::string, std::string> feature_key_map, StorageType storage_type, DecoderType decoder_type, bool shuffle, bool loop,
-              size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader, bool decoder_keep_orig = false, std::pair<RocalBatchPolicy, bool> last_batch_info = {RocalBatchPolicy::FILL, true}, const char *prefix = "", unsigned sequence_length = 0, unsigned step = 0, unsigned stride = 0, ExternalSourceFileMode external_file_mode = ExternalSourceFileMode::NONE);
+              size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader, bool decoder_keep_orig = false, const ShardingInfo& sharding_info = ShardingInfo(), const char *prefix = "", unsigned sequence_length = 0, unsigned step = 0, unsigned stride = 0, ExternalSourceFileMode external_file_mode = ExternalSourceFileMode::NONE);
 
     std::shared_ptr<LoaderModule> get_loader_module();
 

diff --git a/rocAL/include/loaders/image/node_image_loader_single_shard.h b/rocAL/include/loaders/image/node_image_loader_single_shard.h
@@ -37,7 +37,7 @@ class ImageLoaderSingleShardNode : public Node {
     /// The loader will repeat images if necessary to be able to have images in multiples of the load_batch_count,
     /// for example if there are 10 images in the dataset and load_batch_count is 3, the loader repeats 2 images as if there are 12 images available.
     void init(unsigned shard_id, unsigned shard_count, unsigned cpu_num_threads, const std::string &source_path, const std::string &json_path, StorageType storage_type, DecoderType decoder_type,
-              bool shuffle, bool loop, size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader, bool decoder_keep_orig = false, std::pair<RocalBatchPolicy, bool> last_batch_info = {RocalBatchPolicy::FILL, true},
+              bool shuffle, bool loop, size_t load_batch_count, RocalMemType mem_type, std::shared_ptr<MetaDataReader> meta_data_reader, bool decoder_keep_orig = false, const ShardingInfo& sharding_info = ShardingInfo(),
               const std::map<std::string, std::string> feature_key_map = std::map<std::string, std::string>(), unsigned sequence_length = 0, unsigned step = 0, unsigned stride = 0, ExternalSourceFileMode external_file_mode = ExternalSourceFileMode::NONE);
 
     std::shared_ptr<LoaderModule> get_loader_module();

diff --git a/rocAL/include/meta_data/meta_data_reader.h b/rocAL/include/meta_data/meta_data_reader.h
@@ -100,4 +100,5 @@ class MetaDataReader {
     virtual ImgSize lookup_image_size(const std::string& image_name) { return {}; }
     virtual void set_aspect_ratio_grouping(bool aspect_ratio_grouping) { return; }
     virtual bool get_aspect_ratio_grouping() const { return {}; }
+    virtual std::vector<std::string> get_relative_file_path() { return {}; } // Returns the relative file_path's of the reader 
 };
diff --git a/rocAL/include/meta_data/text_file_meta_data_reader.h b/rocAL/include/meta_data/text_file_meta_data_reader.h
@@ -36,6 +36,7 @@ class TextFileMetaDataReader : public MetaDataReader {
     bool set_timestamp_mode() override { return false; }
 
     const std::map<std::string, std::shared_ptr<MetaData>>& get_map_content() override { return _map_content; }
+    std::vector<std::string> get_relative_file_path() override { return _relative_file_path; }
     TextFileMetaDataReader();
 
    private:
@@ -45,4 +46,5 @@ class TextFileMetaDataReader : public MetaDataReader {
     void add(std::string image_name, int label);
     std::map<std::string, std::shared_ptr<MetaData>> _map_content;
     std::string _path;
+    std::vector<std::string> _relative_file_path {};
 };
diff --git a/rocAL/include/pipeline/commons.h b/rocAL/include/pipeline/commons.h
@@ -159,11 +159,11 @@ struct Timing {
     long long unsigned video_process_time= 0;
 };
 
-/*! \brief Tensor Last Batch Policies
+/*! \brief Tensor Last Batch Policy Type enum
  These policies the last batch policies determine the behavior when there are not enough samples in the epoch to fill the last batch
         FILL - The last batch is filled by either repeating the last sample or by wrapping up the data set.
         DROP - The last batch is dropped if it cannot be fully filled with data from the current epoch.
-        PARTIAL - The last batch is partially filled with the remaining data from the current epoch, and padding the remaining samples with either last image or wrapping up the dataset - the padded images are removed in the python end
+        PARTIAL - The last batch is partially filled with the remaining data from the current epoch, keeping the rest of the samples empty. (currently this policy works similar to FILL in rocAL, PARTIAL policy needs to be handled in the pytorch iterator)
  */
 enum RocalBatchPolicy {
     FILL = 0,

diff --git a/rocAL/include/pipeline/tensor.h b/rocAL/include/pipeline/tensor.h
@@ -326,7 +326,7 @@ class Tensor : public rocalTensor {
 #endif
     unsigned copy_data(void* user_buffer, RocalOutputMemType external_mem_type) override;
     //! Copying the output buffer with specified max_cols and max_rows values for the 2D buffer of size batch_size
-    unsigned copy_data(void* user_buffer, uint max_rows, uint max_cols); 
+    unsigned copy_data(void* user_buffer, uint x_offset, uint y_offset, uint max_rows, uint max_cols); 
     //! Default destructor
     /*! Releases the OpenVX Tensor object */
     ~Tensor();

diff --git a/rocAL/include/readers/file_source_reader.h b/rocAL/include/readers/file_source_reader.h
@@ -28,8 +28,8 @@ THE SOFTWARE.
 #include <vector>
 
 #include "pipeline/commons.h"
-#include "readers/image/image_reader.h"
 #include "pipeline/timing_debug.h"
+#include "readers/image/image_reader.h"
 
 class FileSourceReader : public Reader {
    public:
@@ -67,9 +67,11 @@ class FileSourceReader : public Reader {
 
     FileSourceReader();
 
-    //! Returns the number of images in the last batch
-    size_t last_batch_padded_size() override;
+    size_t last_batch_padded_size() override;  // The size of the number of samples padded in the last batch
+
+    std::string get_root_folder_path() override;  // Returns the root folder path
 
+    std::vector<std::string> get_file_paths_from_meta_data_reader() override;  // Returns the relative file path from the meta-data reader
    private:
     //! opens the folder containnig the images
     Reader::Status open_folder();
@@ -83,30 +85,37 @@ class FileSourceReader : public Reader {
     unsigned _curr_file_idx;
     FILE *_current_fPtr;
     unsigned _current_file_size;
+    unsigned _shard_start_idx;
+    std::vector<unsigned> _shard_start_idx_vector, _shard_end_idx_vector;
     std::string _last_id;
-    std::string _last_file_name, _last_file_path;
+    std::string _last_file_name, _last_file_path, _absolute_file_path;
     size_t _shard_id = 0;
     size_t _shard_count = 1;  // equivalent of batch size
-    //!< _batch_count Defines the quantum count of the images to be read. It's usually equal to the user's batch size.
-    /// The loader will repeat images if necessary to be able to have images available in multiples of the load_batch_count,
-    /// for instance if there are 10 images in the dataset and _batch_count is 3, the loader repeats 2 images as if there are 12 images available.
-    size_t _batch_count = 1;
-    size_t _file_id = 0;
-    size_t _in_batch_read_count = 0;
+    int32_t _shard_size = -1;
+    size_t _batch_size = 1;
+    size_t _padded_samples = 0;
     bool _loop;
     bool _shuffle;
     int _read_counter = 0;
     //!< _file_count_all_shards total_number of files in to figure out the max_batch_size (usually needed for distributed training).
     size_t _file_count_all_shards;
     void incremenet_read_ptr();
+    void increment_curr_file_idx();
     int release();
-    size_t get_file_shard_id();
-    void incremenet_file_id() { _file_id++; }
     void fill_last_batch();
     void replicate_last_batch_to_pad_partial_shard();
     std::shared_ptr<MetaDataReader> _meta_data_reader = nullptr;
-    //! Pair containing the last batch policy and last_batch_padded values for deciding what to do with last batch
-    std::pair<RocalBatchPolicy, bool> _last_batch_info;
-    size_t _last_batch_padded_size = 0;
-    Reader::Status generate_file_names();
+    //! Pair containing the last batch policy and pad_last_batch_repeated values for deciding what to do with last batch
+    ShardingInfo _last_batch_info = ShardingInfo();  // The members of ShardingInfo determines how the data is distributed among the shards and how the last batch is processed by the pipeline.
+    size_t _last_batch_padded_size = 0;              // The size of number of padded samples in the last batch
+    size_t _num_padded_samples = 0;                  //! Number of samples that are padded in the last batch which would differ for each shard.
+    bool _stick_to_shard = false;
+    bool _pad_last_batch_repeated = false;
+    Reader::Status generate_file_names();         // Function that would generate _file_names containing all the samples in the dataset
+    void compute_start_and_end_idx_of_all_shards();     // Start Idx of all the Shards
+    size_t get_dataset_size();                    // DataSet Size
+    size_t actual_shard_size_without_padding();   // Actual Number of Files present in the shard (without padding)
+    size_t largest_shard_size_without_padding();  // The size of the shard having largest files (without padding)
+    //!< Used to advance to the next shard's data to increase the entropy of the data seen by the pipeline>
+    void increment_shard_id();
 };