Gzip decompress proxy for wget (mirror/recursive download)


Wget works pretty good for mirroring sites. Especially since version 1.18 which introduced support for img srcset parsing (unlike httrack, which unfortunately lacks that).

The trouble with wget is that it doesn’t handle gzipped responses. (Which is IMHO quite surprising in 2017.) Wget always requests not compressed data with following:

Accept-Encoding: identity

… but in real life today, various sites (including big ones like Google) send gzip encoded content.

Even through the response is correctly marked…

Content-Encoding: gzip

… wget ignores that and stores responses as they are. That is usually still acceptable as simple unzip solves it. Unfortunately gzipped responses in recursive/mirror mode can prevent getting some content (like images from CSS) or even parts of the mirrored site.

Squid with some 3rd party plugin could probably handle that but I’m not familiar with configuring Squid. Instead I used LittleProxy to create HTTP/HTTPS proxy that decompresses compressed responses.

import org.littleshoot.proxy.HttpFiltersSourceAdapter;
import org.littleshoot.proxy.HttpProxyServer;
import org.littleshoot.proxy.extras.SelfSignedMitmManager;
import org.littleshoot.proxy.impl.DefaultHttpProxyServer;

public class Main {
    public static void main(String[] args) {
        HttpProxyServer httpProxyServer = DefaultHttpProxyServer.bootstrap()
                .withPort(8080)
                .withManInTheMiddle(new SelfSignedMitmManager())
                .withFiltersSource(new HttpFiltersSourceAdapter() {
                    @Override
                    public int getMaximumResponseBufferSizeInBytes() {
                        //      MB *   kB *    B
                        return 196 * 1024 * 1024;
                    }
                })
                .start();
    }
}

Because the responses are modified, HTTPS needs to be resigned with your own certificate. You can either setup properly your own authority and import it everywhere you use it or use the key provided by LittleProxy and set wget to ignore untrusted certificates. The complete extra arguments for wget are:

wget ... --execute use_proxy=yes --execute http_proxy=127.0.0.1:8080 --execute https_proxy=127.0.0.1:8080 --no-check-certificate ...

The LittleProxy way has its drawbacks through. Responses with data larger than the MaximumResponseBuffer (196 MB in my case) will not pass through and error 502 will be thrown. It would be possible to exclude decompression based on the request url for HTTP but not HTTPS. The best solution would be to have exclusion based on response mime type (for text/*) but again, unfortunately it’s not possible for HTTPS (at least in the current implementation of LittleProxy).

Luckily this is enough for my projects. You can still use httrack instead if you don’t need srcset support and are ok with query parameters being converted to hash.