Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ [RUM-162] Truncate resources URL containing data URLs #2690

Merged
merged 12 commits into from
Apr 11, 2024

Conversation

cy-moi
Copy link
Contributor

@cy-moi cy-moi commented Apr 5, 2024

Motivation

We're currently collecting the full url data of resources using data: based resource.url .
This is highly inefficient for the ingestion and indexing side and may contain sensitive informations. This data is always truncated to 24k characters when being indexed, which would result in partial images.
In this case, we want to truncate the partial data till only keep the useful information (MIME type, encodings).

Changes

We truncate data:url to the embedding codec when they exceeds 24k
Examples
Screenshot 2024-04-08 at 18 21 51

Edit (Added truncate indicator [...])
Screenshot 2024-04-10 at 17 21 26

Testing

  • Local
  • Staging
  • Unit
  • End to end

I have gone over the contributing documentation.

Copy link

cit-pr-commenter bot commented Apr 5, 2024

Bundles Sizes Evolution

📦 Bundle Name Base Size Local Size 𝚫 𝚫% Status
Rum 156.48 KiB 156.64 KiB 160 B +0.10%
Logs 55.34 KiB 55.34 KiB 0 B 0.00%
Rum Slim 102.98 KiB 103.18 KiB 208 B +0.20%
Worker 25.21 KiB 25.21 KiB 0 B 0.00%

@codecov-commenter
Copy link

codecov-commenter commented Apr 5, 2024

Codecov Report

Attention: Patch coverage is 92.85714% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 93.36%. Comparing base (61258c1) to head (3f1f4c4).
Report is 10 commits behind head on main.

Files Patch % Lines
...ages/rum-core/src/domain/resource/resourceUtils.ts 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2690      +/-   ##
==========================================
- Coverage   93.36%   93.36%   -0.01%     
==========================================
  Files         240      240              
  Lines        6981     6992      +11     
  Branches     1539     1542       +3     
==========================================
+ Hits         6518     6528      +10     
- Misses        463      464       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cy-moi
Copy link
Contributor Author

cy-moi commented Apr 5, 2024

/to-staging

@dd-devflow
Copy link
Contributor

dd-devflow bot commented Apr 5, 2024

🚂 Branch Integration: starting soon, merge in < 10m

Commit 477629a771 will soon be integrated into staging-14.

This build is going to start soon! (estimated merge in less than 10m)

Use /to-staging -c to cancel this operation!

dd-mergequeue bot added a commit that referenced this pull request Apr 5, 2024
@dd-devflow
Copy link
Contributor

dd-devflow bot commented Apr 5, 2024

🚂 Branch Integration: This commit was successfully integrated

Commit 477629a771 has been merged into staging-14 in merge commit 8ca824d45b.

Check out the triggered pipeline on Gitlab 🦊

@dd-devflow dd-devflow bot added the staging-14 label Apr 5, 2024

return attributeValue
// Truncate data:url to avoid performance impact
return findDataUrlAndTruncate(attributeValue) ?? attributeValue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ question: ‏You are truncating all data:url not only the one above a MAX_ATTRIBUTE_VALUE_CHAR_LENGTH. Is it intended? If yes, could it have an impact on Replay?

Copy link
Contributor Author

@cy-moi cy-moi Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was intended. But then indeed the truncation would prevent some allowed data to show up properly. So fixing on this right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the length check back to the recorder as well as for the resource url.

packages/rum-core/src/domain/resource/resourceUtils.ts Outdated Show resolved Hide resolved
packages/rum-core/src/domain/resource/resourceUtils.ts Outdated Show resolved Hide resolved
@cy-moi cy-moi requested a review from amortemousque April 8, 2024 14:34
@cy-moi
Copy link
Contributor Author

cy-moi commented Apr 8, 2024

/to-staging

@dd-devflow
Copy link
Contributor

dd-devflow bot commented Apr 8, 2024

🚂 Branch Integration: starting soon, merge in < 0s

Commit 9b3336f9e3 will soon be integrated into staging-15.

This build is going to start soon! (estimated merge in less than 0s)

Use /to-staging -c to cancel this operation!

@dd-devflow
Copy link
Contributor

dd-devflow bot commented Apr 8, 2024

🚨 Branch Integration: The build pipeline contains failing jobs for this merge request

We couldn't automatically merge the commit 9b3336f9e3 into staging-15.
Build pipeline has failing jobs for 151b347

Since those jobs are not marked as being allowed to fail, the pipeline will most likely fail.
Therefore, and to allow other builds to be processed, this merge request has been rejected before the end of the pipeline.

You should have a look at the pipeline, wait for the build to finish and investigate the failures.
When you are ready, re-add your pull request to the queue!

⚠️ Do NOT retry failed jobs directly.

They were executed on a temporary branch created by the merge queue. If you retry them, they are going to fail because the branch no longer exists.

If you think the errors come from a logical conflict with the target branch, you can create a fix by commenting this pull request with /create-fix-branch -b staging-15

Details

List of failed jobs:

Go to failed Gitlab pipeline.

If you need support, contact us on Slack #devflow with those details!

@cy-moi
Copy link
Contributor Author

cy-moi commented Apr 8, 2024

/create-fix-branch -b staging-15

@dd-devflow
Copy link
Contributor

dd-devflow bot commented Apr 8, 2024

🚂 Devflow: /create-fix-branch -b staging-15

Created fix branch fix-merge-9b3336f9e3-into-staging-15 - #2698

@dd-devflow
Copy link
Contributor

dd-devflow bot commented Apr 8, 2024

🚂 Branch Integration: starting soon, merge in < 0s

Commit 9b3336f9e3 will soon be integrated into staging-15.

This build is going to start soon! (estimated merge in less than 0s)

@cy-moi cy-moi changed the title [RUM-162] Truncate resources URL containing data URLs ✨ [RUM-162] Truncate resources URL containing data URLs Apr 8, 2024
@dd-devflow dd-devflow bot added the staging-15 label Apr 8, 2024
@dd-devflow
Copy link
Contributor

dd-devflow bot commented Apr 8, 2024

🚂 Branch Integration

Commit 9b3336f9e3 has been merged into staging-15 in merge commit 2f8877703d.

Check out the triggered pipeline on Gitlab 🦊

@cy-moi cy-moi marked this pull request as ready for review April 8, 2024 16:23
@cy-moi cy-moi requested a review from a team as a code owner April 8, 2024 16:23
const DATA_URL_REGEX = /data:(.+)?(;base64)?,/g
export const MAX_ATTRIBUTE_VALUE_CHAR_LENGTH = 24_000

export function isDataUrlTooLong(url: string): boolean {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 suggestion: ‏ this function name suggests it's taking a data URL as an input and check if it's too long, while what it is actually doing is taking a string as en input check if it's a data URL and if it's long.

I propose to either:

  • rename to something like isLongDataUrl (I think the too is not adding much value)
  • split the function in two such as isAboveLimit and isDataUrl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I searched the codebase and it seems like we are only using this limit length for data URL, hence making it into separate functions feels redundant. But the naming is indeed not accurate. I have renamed it to isLongDataUrl.

@cy-moi cy-moi requested a review from thomas-lebeau April 9, 2024 10:54
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also truncate the url coming from performance resource entries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes on our slack discussion, we found that here it says:

If an HTML IMG element has a data: URI as its source [RFC2397], then this resource will not be included as a PerformanceResourceTiming object in the Performance Timeline. By definition data: URI contains embedded data and does not require a fetch.

So we do not sanitize data url in this case.

}

export function sanitizeDataUrl(url: string): string {
return url.match(DATA_URL_REGEX)![0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 suggestion:
It could be nice to have something indicating that it the url has been truncated like data:[<mediatype>] [...] (the same way we do for action names)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the [...] as in action names to indicate the truncation.

@cy-moi cy-moi merged commit 2045899 into main Apr 11, 2024
17 checks passed
@cy-moi cy-moi deleted the congyao/RUM-162-truncate-data-url-for-resources branch April 11, 2024 14:33
@cy-moi cy-moi restored the congyao/RUM-162-truncate-data-url-for-resources branch April 11, 2024 16:58
@cy-moi cy-moi deleted the congyao/RUM-162-truncate-data-url-for-resources branch April 12, 2024 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants