[wget-notify] [bug #20371] Generalize --html-extension -> --mime-extensions

Micah Cowan INVALID.NOREPLY at gnu.org
Mon Mar 31 22:32:48 PDT 2008


Follow-up Comment #1, bug #20371 (project wget):

(The following is mainly copy-pasted from my response to a GSoC proposal.)

As a directly applicable example, someone might wish to download all HTML
content from a website. This is easy, as you can just specify "--accept
.htm,.html" to download all files that end with those extensions.

...except that, quite a lot of HTML content is made available via URLs that
_don't_ end in those suffixes. For instance, an "index.php" file might be
HTML. But then, "index.php?show=logo" might resolve to a JPEG image or
somesuch.

Using "filename suffixes" is a poor way to try to determine what sort of
thing a given URL refers to. HTTP has built into it ways of identifying
resources, and we should be using _that_ instead of filenames.

For instance, .htm and .html files and the HTML index.php file, will all
identify their Content-Type as "text/html". On the other hand, the
"index.php?show=logo" that resolves to a JPEG would be identified as
"image/jpeg". It would be very useful to have Wget support accept/reject based
on the information in the Content-Type header, rather than filenames. This
would be determined by initial HEAD requests to verify that the type is
acceptable, before issuing a full-blown GET request to fetch it.

The advantage is that Content-Type offers a much more precise way to accept
or reject resources based on what kind of resource they are. The disadvantage
is that, unless type information is embedded in the links themselves (which
HTML supports, but few authors take advantage of), it requires extra requests
to be sent to the server to determine what the Content-Type is. Also, not all
servers support HEAD for all resources. Name-based matches have the advantage
that the decision is made without involving the network.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?20371>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/



More information about the wget-notify mailing list