[RFC] don't sniff the filename to determine the content type #4545

Stebalien · 2018-01-04T18:11:47Z

fixes #4543, may break other things

IMO, this is the best way to do this (for now, until we start manually storing the content type along with the files). I'd rather not guess at all but we can at least avoid guessing by filename.

whyrusleeping · 2018-01-04T18:20:25Z

hrm... is this the best solution? Or should we strip the query parameters from the url before passing it to serveContent?

whyrusleeping · 2018-01-04T18:20:34Z

also tests pls

Stebalien · 2018-01-05T00:31:26Z

Or should we strip the query parameters from the url before passing it to serveContent?

Then we may end up serving, e.g., HTML files as .php files. What happens when we simply don't send a content type?

whyrusleeping · 2018-01-05T00:35:02Z

hrm... Right. Because the file in the original issue has a .php extension, but its really an html file that is a snapshot of a server response... Thats annoying

djdv · 2018-01-05T01:35:35Z

Maybe it would be best to detect and set the Content-Type header for the response before calling ServeContent, since the http package will re-use it if it's set (reference:1 & 2).
In the future I could see something like "if type stored/available, use type, else detect type; then set header and servecontent"
Even though DetectContentType is called inside the http function (when no name is given, otherwise it detects via extension) it may be better to handle type detection outside of that with something purpose made for detecting file formats. The http DetectContentType has a small set of types it can detect compared to a typical "magic file format" package.

However I'm not sure how important it is to cover more than what is already covered by the http package. Is it important for the response type to be accurate for things other than say HTML documents which are already covered by pkg http?

And later, if a type is manually stored, should it be trusted or only used as a fallback if detection cannot determine what type the content is?

Stebalien · 2018-01-05T02:37:35Z

The ideal solution would be to:

Plumb through MIME types from unixfs to the gateway.
Make ipfs files cp /http/example.com /my/website work (or ipfs files wget http://example.com /a/b/c if we don't want to start reserving protocol directories).

kevina · 2018-01-08T18:10:40Z

I agree with @djdv that we should detect the content type before hand so we have control over the process. For example, and for some types using the extension may be better then content sniffing (for example distinguishing between a plain text and html documents, a plain text file could start with something that looks like html).

Unless I am missing something MIME types are not really used in unixfs right now. And even if there where how are they set? There has to be some auto-detecting going on somewhere.

Stebalien · 2018-01-08T20:34:52Z

@kevina go's HTTP server is doing this. Unfortunately, we need to do it somewhere because web browsers expect it.

kevina · 2018-01-08T21:02:46Z

@Stebalien yes I know that. My point was I agree with @djdv that we should be detecting the content type and set the Content-Type header before calling ServeContent. For now we can just use DetectContentType but in the future we might want to do something more sophisticated.

Stebalien · 2018-01-08T21:07:55Z

Ah. Sorry, I thought you were asking where we currently do the detection. Yes, I agree. We should be detecting on add (and should allow adding files directly from an http server so we can use the reported content types).

kevina · 2018-01-08T21:21:30Z

I didn't say we should be detecting on add just that we should set content type before we call ServeContent. Basically instead of

http.ServeContent(w, name, filename, modtime, content)

we do

if (content type unset) {
  // detect file type and set content type, for now just use the files content, 
  // in the future we may also use the filename 
}
http.ServeContent(w, req, "", modtime, content)

Stebalien · 2018-01-09T00:03:30Z

Got it. I don't really see any reason to do that now if we're not going to actually do anything with it. Manually setting the content type before serving the file by using DetectContentType is equivalent to just letting http do it. We can pull that out once we decide how to actually handle content types.

eingenito · 2018-10-02T20:40:38Z

@Stebalien we were just looking at #5369 and that made us wonder what the status of this PR was? Was the decision made to close drop this fix? Should we close this PR?

Stebalien · 2018-10-02T23:49:42Z

No, I still believe this is the correct solution (for now). I just got distracted.

@djdv, @kevina when we actually add proper MIME-Type support, we can pre-set the content type in serveFile. However, I still don't see why we should manually set the content type if we can't do a better job than http.ServeContent. All we'll end up doing is duplicating this block of code from http which seems like a waste of time and effort.

// read a chunk to decide between utf-8 text and binary
var buf [sniffLen]byte
n, _ := io.ReadFull(content, buf[:])
ctype = DetectContentType(buf[:n])
_, err := content.Seek(0, io.SeekStart) // rewind to output whole file
if err != nil {
    Error(w, "seeker can't seek", StatusInternalServerError)
    return
}
w.Header().Set("Content-Type", ctype)

fixes #4543, may break other things IMO, this is the best way to do this (for now, until we start manually storing the content type along with the files). I'd rather not guess at all but we can at least avoid guessing by filename. License: MIT Signed-off-by: Steven Allen <[email protected]>

djdv · 2018-10-03T16:14:08Z

@Stebalien

I still don't see why we should manually set the content type if we can't do a better job than http.ServeContent

My prior implication is that http.ServeConten{http.DetectContentType(...)} only contains a subset of types, compared to third party packages which are purpose made for detecting types. (like anything utilizing libmagic)

Since the previous post, http.DetectContentType has added (a lot) more types, but I still wonder if it's valuable to maintain control of this in our domain.

You're right that it's the same process, the only difference here would be in the breadth of supported types, seemingly nothing else.
The question is more focused around if supporting more types than Go's http package is important to us or not.
Currently, it looks like it supports most common formats https://golang.org/src/net/http/sniff.go?s=1337:1340

And with that, we likely have the ability to decide on proper icons/thumbnails for anything we'd need, but if we want to support anything outside of that range, we would have to do it ourselves.

For context, mp4's were not detected when I made the previous post, among other common formats.
This has since changed (yay!).

kevina · 2018-10-03T19:39:11Z

@Stebalien my main concern is that by using only the context type it is easy to get the "text/plain" and "text/html" wrong. It is very easy to construct a text file that looks like HTML but is not really valid HTML.

kevina · 2018-10-03T19:42:08Z

core/corehttp/gateway_handler.go

@@ -387,14 +383,14 @@ func (s *sizeSeeker) Seek(offset int64, whence int) (int64, error) {
 	return s.sizeReadSeeker.Seek(offset, whence)
 }

-func (i *gatewayHandler) serveFile(w http.ResponseWriter, req *http.Request, name string, modtime time.Time, content io.ReadSeeker) {
+func (i *gatewayHandler) serveFile(w http.ResponseWriter, req *http.Request, modtime time.Time, content io.ReadSeeker) {


I would keep this parameter but ignore it, to make it easier to change the logic later down the road, just add a comment that the name is ignored for now.

Stebalien · 2018-10-04T22:48:40Z

It is very easy to construct a text file that looks like HTML but is not really valid HTML.

Yeah... I agree. Unfortunately, I can't think of any other way to fix this.

However, I've filed a new PR (#5564) to fix the new issue.

djdv

clearing this from the review queue, remark was made here:
#4545 (comment)

Stebalien · 2018-10-24T11:31:45Z

Closing this as "wait for unixfs-2.0".

ghost assigned Stebalien Jan 4, 2018

ghost added the status/in-progress In progress label Jan 4, 2018

Stebalien requested review from a user and whyrusleeping January 4, 2018 18:12

Stebalien changed the title ~~don't sniff the filename to determine the content type~~ [RFC] don't sniff the filename to determine the content type Jan 4, 2018

Stebalien mentioned this pull request Jan 4, 2018

MIME type sniffing bug (filename prioritized over content?) #4543

Open

djdv mentioned this pull request Jan 27, 2018

Gateway shows file icon for directory called protocol.ai instead of directory icon #4617

Open

lidel mentioned this pull request Mar 7, 2018

Install Firefox Beta from a link, instead of downloading the .xpi ipfs/ipfs-companion#406

Closed

eingenito mentioned this pull request Oct 2, 2018

IPNS dnslink tld interpreted as file ending, which affects assumed content-type #5369

Closed

Stebalien force-pushed the fix/4543 branch from a63be70 to 60d6b8d Compare October 2, 2018 23:52

Stebalien requested a review from Kubuxu as a code owner October 2, 2018 23:52

Stebalien requested review from kevina and djdv October 3, 2018 00:22

kevina reviewed Oct 3, 2018

View reviewed changes

djdv reviewed Oct 9, 2018

View reviewed changes

Stebalien closed this Oct 24, 2018

ghost removed the status/in-progress In progress label Oct 24, 2018

Stebalien deleted the fix/4543 branch February 28, 2019 22:46

Stebalien restored the fix/4543 branch May 30, 2019 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] don't sniff the filename to determine the content type #4545

[RFC] don't sniff the filename to determine the content type #4545

Stebalien commented Jan 4, 2018

whyrusleeping commented Jan 4, 2018

whyrusleeping commented Jan 4, 2018

Stebalien commented Jan 5, 2018

whyrusleeping commented Jan 5, 2018

djdv commented Jan 5, 2018

Stebalien commented Jan 5, 2018

kevina commented Jan 8, 2018

Stebalien commented Jan 8, 2018

kevina commented Jan 8, 2018

Stebalien commented Jan 8, 2018

kevina commented Jan 8, 2018 •

edited

Loading

Stebalien commented Jan 9, 2018

eingenito commented Oct 2, 2018 •

edited

Loading

Stebalien commented Oct 2, 2018

djdv commented Oct 3, 2018 •

edited

Loading

kevina commented Oct 3, 2018

kevina Oct 3, 2018

Stebalien Oct 4, 2018

Stebalien commented Oct 4, 2018

djdv left a comment

Stebalien commented Oct 24, 2018

[RFC] don't sniff the filename to determine the content type #4545

[RFC] don't sniff the filename to determine the content type #4545

Conversation

Stebalien commented Jan 4, 2018

whyrusleeping commented Jan 4, 2018

whyrusleeping commented Jan 4, 2018

Stebalien commented Jan 5, 2018

whyrusleeping commented Jan 5, 2018

djdv commented Jan 5, 2018

Stebalien commented Jan 5, 2018

kevina commented Jan 8, 2018

Stebalien commented Jan 8, 2018

kevina commented Jan 8, 2018

Stebalien commented Jan 8, 2018

kevina commented Jan 8, 2018 • edited Loading

Stebalien commented Jan 9, 2018

eingenito commented Oct 2, 2018 • edited Loading

Stebalien commented Oct 2, 2018

djdv commented Oct 3, 2018 • edited Loading

kevina commented Oct 3, 2018

kevina Oct 3, 2018

Choose a reason for hiding this comment

Stebalien Oct 4, 2018

Choose a reason for hiding this comment

Stebalien commented Oct 4, 2018

djdv left a comment

Choose a reason for hiding this comment

Stebalien commented Oct 24, 2018

kevina commented Jan 8, 2018 •

edited

Loading

eingenito commented Oct 2, 2018 •

edited

Loading

djdv commented Oct 3, 2018 •

edited

Loading