Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added full outer join #822

Merged
merged 5 commits into from
Jan 20, 2025
Merged

Added full outer join #822

merged 5 commits into from
Jan 20, 2025

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Jan 15, 2025

Adds full outer join support to our interface DataChain.merge(...) method and lower level DatasetQuery.join(...).
Up until now we only had left outer join and inner join.

Note that since lower SQLite versions (< 3.39.0) doesn't support full outer joins out of the box there is a workaround with 2 left joins + union.

Example:

merged  = dogs_cats.merge(cats, "file__path", full=True)

@ilongin ilongin marked this pull request as draft January 15, 2025 23:26
@ilongin ilongin linked an issue Jan 15, 2025 that may be closed by this pull request
Copy link

codecov bot commented Jan 16, 2025

Codecov Report

Attention: Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.

Project coverage is 87.45%. Comparing base (1962a64) to head (31fc717).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/data_storage/sqlite.py 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #822   +/-   ##
=======================================
  Coverage   87.44%   87.45%           
=======================================
  Files         128      128           
  Lines       11368    11383   +15     
  Branches     1545     1550    +5     
=======================================
+ Hits         9941     9955   +14     
  Misses       1038     1038           
- Partials      389      390    +1     
Flag Coverage Δ
datachain 87.40% <94.73%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

cloudflare-workers-and-pages bot commented Jan 16, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 31fc717
Status: ✅  Deploy successful!
Preview URL: https://e06b1d24.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-808-full-outer-join.datachain-documentation.pages.dev

View logs

@ilongin ilongin marked this pull request as ready for review January 16, 2025 01:16
@ilongin ilongin requested a review from shcheklein January 16, 2025 08:54
else:
ch = ch1.merge(ch2, "emp.person.name", "team.player", full=True)

str_default = String.default_value(test_session.catalog.warehouse.db.dialect)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those value differ across implementations?

what happens then if I pull a dataset from Studio to sqlite?

Copy link
Contributor Author

@ilongin ilongin Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are different. Actually, right now I'm working on a task to fix this as issue is on CH side https://github.com/iterative/studio/issues/11161
If you pull dataset from studio, instead of NULL (or None in the code) it will have "" for string, 0 for int etc. As we discussed, non nullable columns were added in CH for performance reasons from the beginning of CH implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yes, it's related to that CH nulls issue.

Besides that ... I wonder how it will work for cases where there is a column with some default values defined via Pydantic model 🤔 . Probably. we'll need a way to specify default values in the operation itself.

@ilongin ilongin requested a review from shcheklein January 17, 2025 22:39
@ilongin ilongin merged commit c6faa5f into main Jan 20, 2025
38 checks passed
@ilongin ilongin deleted the ilongin/808-full-outer-join branch January 20, 2025 02:28
dreadatour pushed a commit that referenced this pull request Jan 27, 2025
* added main logic for outer join

* fixing filters

* removign datasetquery tests and added more datachain unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add full outer join support
2 participants