Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

count() on file stream returns different result than view on it #469

Closed
gangtao opened this issue Jan 5, 2024 · 1 comment · Fixed by #478
Closed

count() on file stream returns different result than view on it #469

gangtao opened this issue Jan 5, 2024 · 1 comment · Fixed by #478
Assignees
Labels
bug Something isn't working

Comments

@gangtao
Copy link
Contributor

gangtao commented Jan 5, 2024

I am running following query to count the event on a file stream

SELECT 
  count(*)
FROM 
  file('/var/lib/proton/user_files/EQY_US_ALL_TRADE_20231003.gz', 'CSV', 'Time string, Exchange string, Symbol string, Sale_Condition string, Trade_Volume uint32, Trade_Price decimal(20,6), Trade_Stop_Stock_Indicator string, Trade_Correction_Indicator string, Sequence_Number uint32, Trade_Id uint64, Source_of_Trade string, Trade_Reporting_Facility string, Participant_Timestamp string, Trade_Reporting_Facility_TRF_Timestamp string, Trade_Through_Exempt_Indicator uint8')
SETTINGS 
  format_csv_delimiter = '|'

Query id: 7081aea4-765c-4ec7-b6f5-3be4b6198ee6

┌─count()─┐
│ 3008814 │
└─────────┘
┌─count()─┐
│ 5821401 │
└─────────┘
┌─count()─┐
│ 8633988 │
└─────────┘
┌──count()─┐
│ 11446575 │
└──────────┘
┌──count()─┐
│ 14193753 │
└──────────┘
┌──count()─┐
│ 17006340 │
└──────────┘
┌──count()─┐
│ 19753518 │
└──────────┘
┌──count()─┐
│ 22566105 │
└──────────┘
┌──count()─┐
│ 25378692 │
└──────────┘
┌──count()─┐
│ 28191279 │
└──────────┘
┌──count()─┐
│ 31003866 │
└──────────┘
┌──count()─┐
│ 33816453 │
└──────────┘
┌──count()─┐
│ 36694449 │
└──────────┘
┌──count()─┐
│ 39637854 │
└──────────┘
┌──count()─┐
│ 42450441 │
└──────────┘
┌──count()─┐
│ 45263028 │
└──────────┘
┌──count()─┐
│ 48075615 │
└──────────┘
┌──count()─┐
│ 50888202 │
└──────────┘
┌──count()─┐
│ 53700789 │
└──────────┘
┌──count()─┐
│ 56513376 │
└──────────┘
┌──count()─┐
│ 59325963 │
└──────────┘
┌──count()─┐
│ 62138550 │
└──────────┘
┌──count()─┐
│ 64951137 │
└──────────┘
┌──count()─┐
│ 67763724 │
└──────────┘
┌──count()─┐
│ 70576311 │
└──────────┘
┌──count()─┐
│ 73388898 │
└──────────┘

26 rows in set. Elapsed: 54.436 sec. Processed 75.81 million rows, 12.74 GB (1.39 million rows/s., 234.07 MB/s.)

then I create a view on that file stream query

WITH raw AS
  (
    SELECT 
      *
    FROM 
      file('/var/lib/proton/user_files/EQY_US_ALL_TRADE_20231003.gz', 'CSV', 'Time string, Exchange string, Symbol string, Sale_Condition string, Trade_Volume uint32, Trade_Price decimal(20,6), Trade_Stop_Stock_Indicator string, Trade_Correction_Indicator string, Sequence_Number uint32, Trade_Id uint64, Source_of_Trade string, Trade_Reporting_Facility string, Participant_Timestamp string, Trade_Reporting_Facility_TRF_Timestamp string, Trade_Through_Exempt_Indicator uint8')
    SETTINGS 
      format_csv_delimiter = '|'
  ),
raw_tp_time as
(
SELECT 
  *, extract_all_groups(Time, '^(\\d{2})(\\d{2})(\\d{2})(\\d{9})') AS g, to_datetime64(concat('2023-10-03T', (g[1])[1], ':', (g[1])[2], ':', (g[1])[3], '.', (g[1])[4]), 9) AS _tp_time
FROM 
  raw
) select * except g from raw_tp_time

this view just extract timestamp from the file stream

SELECT 
  count(*)
FROM 
  v_nyse_trade_20231003

Query id: a29c6d45-cf14-4d48-b21b-ca0fd7b6a4aa

┌─count()─┐
│ 3008814 │
└─────────┘
┌─count()─┐
│ 6017628 │
└─────────┘
┌─count()─┐
│ 9026442 │
└─────────┘
┌──count()─┐
│ 12035256 │
└──────────┘
┌──count()─┐
│ 15044070 │
└──────────┘
┌──count()─┐
│ 18052884 │
└──────────┘
┌──count()─┐
│ 21061698 │
└──────────┘
┌──count()─┐
│ 24070512 │
└──────────┘
┌──count()─┐
│ 27079326 │
└──────────┘
┌──count()─┐
│ 30088140 │
└──────────┘
┌──count()─┐
│ 33162363 │
└──────────┘
┌──count()─┐
│ 36171177 │
└──────────┘
┌──count()─┐
│ 39179991 │
└──────────┘
┌──count()─┐
│ 42188805 │
└──────────┘
┌──count()─┐
│ 45197619 │
└──────────┘
┌──count()─┐
│ 48206433 │
└──────────┘
┌──count()─┐
│ 51280656 │
└──────────┘
┌──count()─┐
│ 54354879 │
└──────────┘
┌──count()─┐
│ 57298284 │
└──────────┘
┌──count()─┐
│ 60307098 │
└──────────┘
┌──count()─┐
│ 63315912 │
└──────────┘
┌──count()─┐
│ 66324726 │
└──────────┘
┌──count()─┐
│ 69333540 │
└──────────┘
┌──count()─┐
│ 72342354 │
└──────────┘
┌──count()─┐
│ 75351168 │
└──────────┘

25 rows in set. Elapsed: 50.767 sec. Processed 75.81 million rows, 12.74 GB (1.49 million rows/s., 250.99 MB/s.)

the second query has more result than the first one, I beleive the view did not cause new rows. so one of the count or both must be wrong.

also I tried to insert the query from the view into a real stream.

INSERT INTO nyse_trade_20231003 SELECT 
  *
FROM 
  v_nyse_trade_20231003

and then count what is in that stream, the count is also different than the previsous two

SELECT 
  count(*)
FROM 
  table(nyse_trade_20231003)

Query id: 7cb51827-b7f4-4348-826c-7c6bc9916331

┌──count()─┐
│ 75811661 │
└──────────┘

1 row in set. Elapsed: 0.128 sec. Processed 75.81 million rows, 75.81 MB (592.76 million rows/s., 592.76 MB/s.)
@gangtao gangtao added the bug Something isn't working label Jan 5, 2024
@gangtao
Copy link
Contributor Author

gangtao commented Jan 5, 2024

I re inset the stream again, and got same count

select count(*) from table(nyse_trade_20231003)

SELECT 
  count(*)
FROM 
  table(nyse_trade_20231003)

Query id: 222e9b4f-dc8d-490c-bc50-82069654ab7f

┌──count()─┐
│ 75811661 │
└──────────┘

1 row in set. Elapsed: 0.059 sec. Processed 75.81 million rows, 75.81 MB (1.28 billion rows/s., 1.28 GB/s.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants