Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] don't sniff the filename to determine the content type #4545

Closed
wants to merge 1 commit into from

Conversation

Stebalien
Copy link
Member

fixes #4543, may break other things

IMO, this is the best way to do this (for now, until we start manually storing the content type along with the files). I'd rather not guess at all but we can at least avoid guessing by filename.

@ghost ghost assigned Stebalien Jan 4, 2018
@ghost ghost added the status/in-progress In progress label Jan 4, 2018
@Stebalien Stebalien requested review from a user and whyrusleeping January 4, 2018 18:12
@Stebalien Stebalien changed the title don't sniff the filename to determine the content type [RFC] don't sniff the filename to determine the content type Jan 4, 2018
@whyrusleeping
Copy link
Member

hrm... is this the best solution? Or should we strip the query parameters from the url before passing it to serveContent?

@whyrusleeping
Copy link
Member

also tests pls

@Stebalien
Copy link
Member Author

Or should we strip the query parameters from the url before passing it to serveContent?

Then we may end up serving, e.g., HTML files as .php files. What happens when we simply don't send a content type?

@whyrusleeping
Copy link
Member

hrm... Right. Because the file in the original issue has a .php extension, but its really an html file that is a snapshot of a server response... Thats annoying

@djdv
Copy link
Contributor

djdv commented Jan 5, 2018

Maybe it would be best to detect and set the Content-Type header for the response before calling ServeContent, since the http package will re-use it if it's set (reference:1 & 2).
In the future I could see something like "if type stored/available, use type, else detect type; then set header and servecontent"
Even though DetectContentType is called inside the http function (when no name is given, otherwise it detects via extension) it may be better to handle type detection outside of that with something purpose made for detecting file formats. The http DetectContentType has a small set of types it can detect compared to a typical "magic file format" package.

However I'm not sure how important it is to cover more than what is already covered by the http package. Is it important for the response type to be accurate for things other than say HTML documents which are already covered by pkg http?

And later, if a type is manually stored, should it be trusted or only used as a fallback if detection cannot determine what type the content is?

@Stebalien
Copy link
Member Author

The ideal solution would be to:

  1. Plumb through MIME types from unixfs to the gateway.
  2. Make ipfs files cp /http/example.com /my/website work (or ipfs files wget http://example.com /a/b/c if we don't want to start reserving protocol directories).

@kevina
Copy link
Contributor

kevina commented Jan 8, 2018

I agree with @djdv that we should detect the content type before hand so we have control over the process. For example, and for some types using the extension may be better then content sniffing (for example distinguishing between a plain text and html documents, a plain text file could start with something that looks like html).

Unless I am missing something MIME types are not really used in unixfs right now. And even if there where how are they set? There has to be some auto-detecting going on somewhere.

@Stebalien
Copy link
Member Author

@kevina go's HTTP server is doing this. Unfortunately, we need to do it somewhere because web browsers expect it.

@kevina
Copy link
Contributor

kevina commented Jan 8, 2018

@Stebalien yes I know that. My point was I agree with @djdv that we should be detecting the content type and set the Content-Type header before calling ServeContent. For now we can just use DetectContentType but in the future we might want to do something more sophisticated.

@Stebalien
Copy link
Member Author

Ah. Sorry, I thought you were asking where we currently do the detection. Yes, I agree. We should be detecting on add (and should allow adding files directly from an http server so we can use the reported content types).

@kevina
Copy link
Contributor

kevina commented Jan 8, 2018

I didn't say we should be detecting on add just that we should set content type before we call ServeContent. Basically instead of

http.ServeContent(w, name, filename, modtime, content)

we do

if (content type unset) {
  // detect file type and set content type, for now just use the files content, 
  // in the future we may also use the filename 
}
http.ServeContent(w, req, "", modtime, content)

@Stebalien
Copy link
Member Author

Got it. I don't really see any reason to do that now if we're not going to actually do anything with it. Manually setting the content type before serving the file by using DetectContentType is equivalent to just letting http do it. We can pull that out once we decide how to actually handle content types.

@eingenito
Copy link
Contributor

eingenito commented Oct 2, 2018

@Stebalien we were just looking at #5369 and that made us wonder what the status of this PR was? Was the decision made to close drop this fix? Should we close this PR?

@Stebalien
Copy link
Member Author

No, I still believe this is the correct solution (for now). I just got distracted.

@djdv, @kevina when we actually add proper MIME-Type support, we can pre-set the content type in serveFile. However, I still don't see why we should manually set the content type if we can't do a better job than http.ServeContent. All we'll end up doing is duplicating this block of code from http which seems like a waste of time and effort.

// read a chunk to decide between utf-8 text and binary
var buf [sniffLen]byte
n, _ := io.ReadFull(content, buf[:])
ctype = DetectContentType(buf[:n])
_, err := content.Seek(0, io.SeekStart) // rewind to output whole file
if err != nil {
    Error(w, "seeker can't seek", StatusInternalServerError)
    return
}
w.Header().Set("Content-Type", ctype)

fixes #4543, may break other things

IMO, this is the best way to do this (for now, until we start manually storing
the content type along with the files). I'd rather not guess at all but we can
at least avoid guessing by filename.

License: MIT
Signed-off-by: Steven Allen <[email protected]>
@djdv
Copy link
Contributor

djdv commented Oct 3, 2018

@Stebalien

I still don't see why we should manually set the content type if we can't do a better job than http.ServeContent

My prior implication is that http.ServeConten{http.DetectContentType(...)} only contains a subset of types, compared to third party packages which are purpose made for detecting types. (like anything utilizing libmagic)

Since the previous post, http.DetectContentType has added (a lot) more types, but I still wonder if it's valuable to maintain control of this in our domain.

You're right that it's the same process, the only difference here would be in the breadth of supported types, seemingly nothing else.
The question is more focused around if supporting more types than Go's http package is important to us or not.
Currently, it looks like it supports most common formats https://golang.org/src/net/http/sniff.go?s=1337:1340

And with that, we likely have the ability to decide on proper icons/thumbnails for anything we'd need, but if we want to support anything outside of that range, we would have to do it ourselves.

For context, mp4's were not detected when I made the previous post, among other common formats.
This has since changed (yay!).

@kevina
Copy link
Contributor

kevina commented Oct 3, 2018

@Stebalien my main concern is that by using only the context type it is easy to get the "text/plain" and "text/html" wrong. It is very easy to construct a text file that looks like HTML but is not really valid HTML.

@@ -387,14 +383,14 @@ func (s *sizeSeeker) Seek(offset int64, whence int) (int64, error) {
return s.sizeReadSeeker.Seek(offset, whence)
}

func (i *gatewayHandler) serveFile(w http.ResponseWriter, req *http.Request, name string, modtime time.Time, content io.ReadSeeker) {
func (i *gatewayHandler) serveFile(w http.ResponseWriter, req *http.Request, modtime time.Time, content io.ReadSeeker) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this parameter but ignore it, to make it easier to change the logic later down the road, just add a comment that the name is ignored for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

@Stebalien
Copy link
Member Author

It is very easy to construct a text file that looks like HTML but is not really valid HTML.

Yeah... I agree. Unfortunately, I can't think of any other way to fix this.

However, I've filed a new PR (#5564) to fix the new issue.

Copy link
Contributor

@djdv djdv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearing this from the review queue, remark was made here:
#4545 (comment)

@Stebalien
Copy link
Member Author

Closing this as "wait for unixfs-2.0".

@Stebalien Stebalien closed this Oct 24, 2018
@ghost ghost removed the status/in-progress In progress label Oct 24, 2018
@Stebalien Stebalien deleted the fix/4543 branch February 28, 2019 22:46
@Stebalien Stebalien restored the fix/4543 branch May 30, 2019 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MIME type sniffing bug (filename prioritized over content?)
5 participants