Skip to content
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.

Index generic over hash function #75

Closed
kirillt opened this issue Jan 28, 2024 · 8 comments · Fixed by #90
Closed

Index generic over hash function #75

kirillt opened this issue Jan 28, 2024 · 8 comments · Fixed by #90
Labels
enhancement New feature or request

Comments

@kirillt
Copy link
Member

kirillt commented Jan 28, 2024

We can abstract ResourceIndex over the id type and, consequently, over hashing function.

#[derive(Eq, Ord, PartialEq, PartialOrd, Hash, Clone, Debug)]
pub struct IndexEntry<Id> {
    pub modified: SystemTime,
    pub id: Id,
}

#[derive(PartialEq, Clone, Debug)]
pub struct ResourceIndex<Id> {
    pub id2path: HashMap<Id, CanonicalPathBuf>,
    pub path2id: HashMap<CanonicalPathBuf, IndexEntry<Id>>,

    pub collisions: HashMap<Id, usize>,
    root: PathBuf,
}

This way, we can use cryptographic hash functions to test index in simpler cases, when no collisions are present. This can simplify development and debugging. We could also use fake hash functions for testing only collisions.

Cryptographic hash function can also be used for apps where safety is more important than performance. The fast hash functions will be used in experimental "fast mode", it's the most useful for file browser apps.

@kirillt kirillt added the enhancement New feature or request label Jan 28, 2024
@kirillt
Copy link
Member Author

kirillt commented Feb 6, 2024

We should also support two modes of operation for the library:

  • safe or strict using a cryptographic hash function
  • fast using a fast hash function (experimental mode)

@kirillt kirillt changed the title Unit tests generic over hash function Index generic over hash function Feb 6, 2024
@kirillt kirillt moved this to Todo in Development Feb 6, 2024
@tareknaser
Copy link
Collaborator

We could also support two modes of operation for the library:

* `safe` or `strict` using a cryptographic hash function

* `fast` using a fast hash function (experimental mode)

This is a very interesting idea
To achieve this, we can use features with conditional compilation

@tareknaser
Copy link
Collaborator

Maybe we can implement ResourceIdTrait as an alternative to abstracting ResourceIndex over the ID type directly.
It forces consistency across different ID implementations and simplifies handling different hashing algorithms.
Each ResourceId struct will implement ResourceIdTrait and has its own logic for calculating the hash (which can be of different types).

Check out genericResourceId branch for a draft implementation, including how ResourceIdCrc32 works with the trait.

@kirillt
Copy link
Member Author

kirillt commented Mar 6, 2024

@tareknaser When we finish #90 and move the code to ark-rust repo, is it a good idea to split "secure" and "fast" indexes into separate crates? This way we could create optimized index structure without counting collisions when we use secure hash functions.

I also imagine creating something like test-fs-index crate which would verify that fs-index-fast and fs-index-secure are semantically equivalent.

@tareknaser
Copy link
Collaborator

is it a good idea to split "secure" and "fast" indexes into separate crates?

I thought about this as well but couldn't find a straightforward solution. Ideally, I would prefer to decouple the index entirely from the resource ID computations, but this isn't feasible because we need to track collisions with non-cryptographic hash functions.

Once this PR is merged alongside #87, I plan to dedicate some time to exploring options for conditionally compiling the collision resolution functionality.

for example, we could do

#[derive(PartialEq, Clone, Debug)]
pub struct ResourceIndex {
    pub id2path: HashMap<ResourceId, CanonicalPathBuf>,
    pub path2id: HashMap<CanonicalPathBuf, IndexEntry>,

    #[cfg(feature = "non-cryptographic")]
    pub collisions: HashMap<ResourceId, usize>,
    root: PathBuf,
}

but I would dive deeper into it when we settle on the API for ResourceIndex first because it would require modifications all over the code.

@kirillt
Copy link
Member Author

kirillt commented Mar 11, 2024

@tareknaser I like how conditional compilation looks in your example.

    #[cfg(feature = "non-cryptographic")]
    pub collisions: HashMap<ResourceId, usize>,

If it'll have same ergonomics all over the crate, then we could stick to it. Single index implementation should be better. Although it might be a bit more complicated to maintain.

I also imagine creating something like test-fs-index crate which would verify that fs-index-fast and fs-index-secure are semantically equivalent.

This test-fs-index crate would depend on 2 instances of same crate then: 1 with non-cryptographic flag ON, and one with the flag OFF. Is it technically possible?

Ideally, I would prefer to decouple the index entirely from the resource ID computations, but this isn't feasible because we need to track collisions with non-cryptographic hash functions.

Not sure about this part. In fact, ResourceId should be separate from index, because we might have non-fs index later. Or even apps without index, especially when the resources are not persisted. Collisions tracker is updated during index construction and updates, are you sure ResourceId is coupled to it?

@kirillt
Copy link
Member Author

kirillt commented Mar 11, 2024

Actually, we already have such a crate: https://github.com/ARK-Builders/ark-rust/blob/main/data-resource

@tareknaser
Copy link
Collaborator

If it'll have same ergonomics all over the crate, then we could stick to it. Single index implementation should be better. Although it might be a bit more complicated to maintain.

Once the API in issue #91 is finalized, we'll likely simplify it. Additionally, some tests in index.rs might be better suited for integration testing within ark-rust using ark-cli crate. This way we validate the project's integrity between crates as well.

Simplifying the API and including only essential unit tests related to ResourceIndex would make maintenance easier if we add compilation features.

This test-fs-index crate would depend on 2 instances of same crate then: 1 with non-cryptographic flag ON, and one with the flag OFF. Is it technically possible?

It may not be feasible at the moment, but there's an ongoing discussion about it. See rust-lang/cargo#2980 (comment)

Maybe that could work as an integration test. We could create a simple shell script that uses ark-cli to interact with fs-index, running it twice in the CI environment: once with the feature enabled and once disabled.

are you sure ResourceId is coupled to it?

It's not coupled per se but if we're using a non-cryptographic ResourceId, collisions still need to be tracked.

@github-project-automation github-project-automation bot moved this from Todo to Done in Development Mar 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants