wget避免重复下载
- 本文链接地址:http://zhubaining.com/blog/2010/09/18/archives/wget%e9%81%bf%e5%85%8d%e9%87%8d%e5%a4%8d%e4%b8%8b%e8%bd%bd
- 作者:zhubaining
有人问到用wget时如何防止重复下载,即,如果文件没有变化的话,就不要再下载。
简单研究了一下,wget有个-N选项即可完成这个工作:
-N, –timestamping don’t re-retrieve files unless newer than local.
今天顺便提一下“-d”选项,它可以用来输出一些debug信息,包括request和response的header。
-d, –debug print lots of debugging information.
举例:
现在用wget下载一下google的robots文件:http://www.google.com/robots.txt
一开始文件不存在:
zhubaining@zhubaining-laptop:~/tmp$ ls robots.txt
ls: cannot access robots.txt: No such file or directory
wget这个文件:
zhubaining@zhubaining-laptop:~/tmp$ wget http://www.google.com/robots.txt
–2010-09-18 23:14:42– http://www.google.com/robots.txt
Resolving www.google.com… 66.249.89.104
Connecting to www.google.com|66.249.89.104|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: unspecifiedSaving to: `robots.txt’
[ <=> ] 5,173 –.-K/s in 0.006s
2010-09-18 23:14:42 (801 KB/s) – `robots.txt’ saved [5173]
现在再运行wget,加上-N选项和-d选项:
zhubaining@zhubaining-laptop:~/tmp$ wget -N -d http://www.google.com/robots.txt
DEBUG output created by Wget 1.12 on linux-gnu.–2010-09-18 23:15:35– http://www.google.com/robots.txt
Resolving www.google.com… 66.249.89.104
Caching www.google.com => 66.249.89.104
Connecting to www.google.com|66.249.89.104|:80… connected.
Created socket 3.
Releasing 0x09373ed0 (new refcount 1).—request begin—
HEAD /robots.txt HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: www.google.com
Connection: Keep-Alive
—request end—
HTTP request sent, awaiting response…
—response begin—
HTTP/1.0 200 OK
Content-Length: 5173
Content-Type: text/plain
Last-Modified: Mon, 09 Aug 2010 17:36:07 GMT
Date: Sat, 18 Sep 2010 15:15:37 GMT
Expires: Sat, 18 Sep 2010 15:15:37 GMT
Cache-Control: private, max-age=0
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
Server: sffe
X-XSS-Protection: 1; mode=block
Connection: Keep-Alive—response end—
200 OK
Registered socket 3 for persistent reuse.
Length: 5173 (5.1K)Server file no newer than local file `robots.txt’ — not retrieving.
可以看到wget是使用了HTTP HEAD方法来获取文件的修改时间等信息,进而判断是否再次获取的。
现在把文件清空,然后再次运行wget:
zhubaining@zhubaining-laptop:~/tmp$ > robots.txt
zhubaining@zhubaining-laptop:~/tmp$ wget -N -d http://www.google.com/robots.txt
DEBUG output created by Wget 1.12 on linux-gnu.–2010-09-18 23:25:40– http://www.google.com/robots.txt
Resolving www.google.com… 66.249.89.104
Caching www.google.com => 66.249.89.104
Connecting to www.google.com|66.249.89.104|:80… connected.
Created socket 3.
Releasing 0x0946aed0 (new refcount 1).—request begin—
HEAD /robots.txt HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: www.google.com
Connection: Keep-Alive—request end—
HTTP request sent, awaiting response…
—response begin—
HTTP/1.0 200 OK
Content-Length: 5173
Content-Type: text/plain
Last-Modified: Mon, 09 Aug 2010 17:36:07 GMT
Date: Sat, 18 Sep 2010 15:25:42 GMT
Expires: Sat, 18 Sep 2010 15:25:42 GMT
Cache-Control: private, max-age=0
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
Server: sffe
X-XSS-Protection: 1; mode=block
Connection: Keep-Alive—response end—
200 OK
Registered socket 3 for persistent reuse.
Length: 5173 (5.1K)The sizes do not match (local 0) — retrieving.
–2010-09-18 23:25:41– http://www.google.com/robots.txt
Reusing existing connection to www.google.com:80.
Reusing fd 3.—request begin—
GET /robots.txt HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: www.google.com
Connection: Keep-Alive—request end—
HTTP request sent, awaiting response…
—response begin—
HTTP/1.0 200 OK
Content-Type: text/plainLast-Modified: Mon, 09 Aug 2010 17:36:07 GMT
Date: Sat, 18 Sep 2010 15:25:42 GMT
Expires: Sat, 18 Sep 2010 15:25:42 GMT
Cache-Control: private, max-age=0
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
Server: sffe
X-XSS-Protection: 1; mode=block—response end—
200 OK
Length: unspecifiedSaving to: `robots.txt’[ <=> ] 5,173 –.-K/s in 0.007s
Disabling further reuse of socket 3.
Closed fd 3
2010-09-18 23:25:41 (769 KB/s) – `robots.txt’ saved [5173]
可以看到wget显示用HEAD方法获取文件信息,发现文件的大小不同,所以就重新发起请求来获取文件。
现在发现如果文件被修改了但是大小没有变化,wget -N是不会重新获取,因为它觉得当地文件时间要比服务器的新(修改后文件的修改时间改变了):
这是wget下来的文件:
zhubaining@zhubaining-laptop:~/Documents/tmp$ ll robots.txt
-rw-r–r– 1 zhubaining zhubaining 5173 2010-08-10 01:36 robots.txt
现在对文件进行修改,而不改变文件大小:
zhubaining@zhubaining-laptop:~/Documents/tmp$ sed -i ‘s/:/!/’ robots.txt
zhubaining@zhubaining-laptop:~/Documents/tmp$ ll robots.txt
-rw-r–r– 1 zhubaining zhubaining 5173 2010-09-19 13:14 robots.txt
现在重新执行wget -N发现并不会重新获取:
zhubaining@zhubaining-laptop:~/Documents/tmp$ wget -N -d http://www.google.com/robots.txt
DEBUG output created by Wget 1.12 on linux-gnu.–2010-09-19 13:17:20– http://www.google.com/robots.txt
Resolving www.google.com… 66.249.89.104
Caching www.google.com => 66.249.89.104
Connecting to www.google.com|66.249.89.104|:80… connected.
Created socket 3.
Releasing 0x08d7eed0 (new refcount 1).—request begin—
HEAD /robots.txt HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: www.google.com
Connection: Keep-Alive—request end—
HTTP request sent, awaiting response…
—response begin—
HTTP/1.0 200 OK
Content-Length: 5173
Content-Type: text/plain
Last-Modified: Mon, 09 Aug 2010 17:36:07 GMT
Date: Sun, 19 Sep 2010 05:17:22 GMT
Expires: Sun, 19 Sep 2010 05:17:22 GMT
Cache-Control: private, max-age=0
Vary: Accept-Encoding
X-Content-Type-Options: nosniff
Server: sffe
X-XSS-Protection: 1; mode=block
Connection: Keep-Alive—response end—
200 OK
Registered socket 3 for persistent reuse. Length: 5173 (5.1K)
Server file no newer than local file `robots.txt’ — not retrieving.
对这个问题google了一把又一把,没发现wget有其他的选项可以解决。目前来看,至少可以采用这样的策略:如果发现文件修改时间比Last-Modified新,则不带-N参数而直接获取,而文件修改时间变化一般都是因为文件被修改了。这样的策略对于文件修改时间变化但是文件内容并没有变化的情况,处理不当。
p.s.: 实践屡次证明,成熟的命令对于一般的需求都是能满足的,所以,当你有需求的时候,直接可以man xx或者 xx –help。
Recent Comments