<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
><channel><title>fatkun&#039;s blog &#187; 中文乱码</title> <atom:link href="http://fatkun.com/tag/%e4%b8%ad%e6%96%87%e4%b9%b1%e7%a0%81/feed" rel="self" type="application/rss+xml" /><link>http://fatkun.com</link> <description>又一个 WordPress 站点</description> <lastBuildDate>Sun, 05 Sep 2010 14:16:36 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.0.1</generator> <item><title>GAE-Google App Engine网址抓取(java.net.UrlConnection)</title><link>http://fatkun.com/2010/01/get-website-source-using-google-app-engine.html</link> <comments>http://fatkun.com/2010/01/get-website-source-using-google-app-engine.html#comments</comments> <pubDate>Sat, 23 Jan 2010 08:17:12 +0000</pubDate> <dc:creator>fatkun</dc:creator> <category><![CDATA[J2EE]]></category> <category><![CDATA[GAE]]></category> <category><![CDATA[google app engine]]></category> <category><![CDATA[中文乱码]]></category> <category><![CDATA[网址抓取]]></category><guid
isPermaLink="false">http://fatkun.com/?p=254</guid> <description><![CDATA[Google App Engine 的网址抓取挺方便的，可以使用java.net.UrlConnection这个类。有了这个我们可以干什么？例如可以从某处获取天气信息等等~ (提醒一下，上面的是图片。。不要误点了啊。。。) 看看例子：http://2.latest.fatkuns.appspot.com/ GAE网址抓取是什么？ App Engine 应用程序可以抓取资源，并通过互联网使用 HTTP 和 HTTPS 请求与其他主机通信。应用程序使用网址抓取服务来进行请求。 我觉得其实就是可以通过它抓取别人网页的源代码。 使用URL获取源码 package com.fatkun; /** * 在GAE上抓取网址 * @author Fatkun * @site http://fatkun.com */ &#160; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; &#160; import javax.servlet.http.*; &#160; @SuppressWarnings&#40;&#34;serial&#34;&#41; public class URL2Servlet extends HttpServlet &#123; public void doGet&#40;HttpServletRequest req, HttpServletResponse resp&#41; throws IOException &#123; [...]]]></description> <content:encoded><![CDATA[<p>Google App Engine 的网址抓取挺方便的，可以使用java.net.UrlConnection这个类。有了这个我们可以干什么？例如可以从某处获取天气信息等等~<br
/> <img
src="http://farm3.static.flickr.com/2724/4297317270_500f2c34b3.jpg" alt="" /><br
/> (提醒一下，上面的是图片。。不要误点了啊。。。)<br
/> 看看例子：<a
href="http://2.latest.fatkuns.appspot.com/">http://2.latest.fatkuns.appspot.com/</a><br
/> <span
id="more-254"></span></p><h2>GAE网址抓取是什么？</h2><blockquote><p>App Engine 应用程序可以抓取资源，并通过互联网使用 HTTP 和 HTTPS 请求与其他主机通信。应用程序使用网址抓取服务来进行请求。</p></blockquote><p>我觉得其实就是可以通过它抓取别人网页的源代码。</p><h2>使用URL获取源码</h2><div
class="wp_syntax"><div
class="code"><pre class="java" style="font-family:monospace;"><span style="color: #7F0055; font-weight: bold;">package</span> <span style="color: #006699;">com.fatkun</span><span style="color: #339933;">;</span>
<span style="color: #3F7F5F; font-style: normal; ">/**
 * 在GAE上抓取网址
 * @author Fatkun
 * @site http://fatkun.com
 */</span>
&nbsp;
<span style="color: #7F0055; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
<span style="color: #7F0055; font-weight: bold;">import</span> <span style="color: #006699;">java.io.InputStreamReader</span><span style="color: #339933;">;</span>
<span style="color: #7F0055; font-weight: bold;">import</span> <span style="color: #006699;">java.net.URL</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #7F0055; font-weight: bold;">import</span> <span style="color: #006699;">javax.servlet.http.*</span><span style="color: #339933;">;</span>
&nbsp;
@SuppressWarnings<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;serial&quot;</span><span style="color: #009900;">&#41;</span>
<span style="color: #7F0055; font-weight: bold;">public</span> <span style="color: #7F0055; font-weight: bold;">class</span> URL2Servlet <span style="color: #7F0055; font-weight: bold;">extends</span> HttpServlet <span style="color: #009900;">&#123;</span>
	<span style="color: #7F0055; font-weight: bold;">public</span> <span style="color: #7F0055; font-weight: bold;">void</span> doGet<span style="color: #009900;">&#40;</span>HttpServletRequest req, HttpServletResponse resp<span style="color: #009900;">&#41;</span> <span style="color: #7F0055; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
		resp.<span style="color: #000000;">setContentType</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;text/plain; charset=utf-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><span style="color: #3F7F5F; font-style: normal;">//显示编码</span>
&nbsp;
		<span style="color: #003399;">URL</span> url <span style="color: #339933;">=</span> <span style="color: #7F0055; font-weight: bold;">new</span> <span style="color: #003399;">URL</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;http://fatkun.com&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #3F7F5F; font-style: normal;">// 读取源码</span>
		<span style="color: #3F7F5F; font-style: normal;">//读取中文时，使用Reader类是每次读出两个字节的，不会出现中文乱码</span>
		<span style="color: #003399;">InputStreamReader</span> in <span style="color: #339933;">=</span> <span style="color: #7F0055; font-weight: bold;">new</span> <span style="color: #003399;">InputStreamReader</span><span style="color: #009900;">&#40;</span>url.<span style="color: #000000;">openStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #7F0055; font-weight: bold;">char</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> buf <span style="color: #339933;">=</span> <span style="color: #7F0055; font-weight: bold;">new</span> <span style="color: #7F0055; font-weight: bold;">char</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2048</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><span style="color: #3F7F5F; font-style: normal;">//缓存</span>
		<span style="color: #003399;">StringBuffer</span> sb <span style="color: #339933;">=</span> <span style="color: #7F0055; font-weight: bold;">new</span> <span style="color: #003399;">StringBuffer</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #7F0055; font-weight: bold;">int</span> len <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
		<span style="color: #7F0055; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>len <span style="color: #339933;">=</span> in.<span style="color: #000000;">read</span><span style="color: #009900;">&#40;</span>buf<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><span style="color: #3F7F5F; font-style: normal;">//当没到文档尽头继续读取</span>
			sb.<span style="color: #000000;">append</span><span style="color: #009900;">&#40;</span>buf, <span style="color: #cc66cc;">0</span>, len<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #3F7F5F; font-style: normal;">// 输出在网页上</span>
		resp.<span style="color: #000000;">getWriter</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #000000;">println</span><span style="color: #009900;">&#40;</span>sb.<span style="color: #000000;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div><h2>使用HttpURLConnection 来POST内容</h2><div
class="wp_syntax"><div
class="code"><pre class="java" style="font-family:monospace;"><span style="color: #3F7F5F; font-style: normal;">// 此处的地址请换成你的，在本地测试时可以填入http://localhost:8888/request.jsp</span>
<span style="color: #003399;">URL</span> url <span style="color: #339933;">=</span> <span style="color: #7F0055; font-weight: bold;">new</span> <span style="color: #003399;">URL</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;http://2.latest.fatkuns.appspot.com/request.jsp&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #003399;">HttpURLConnection</span> connection <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">HttpURLConnection</span><span style="color: #009900;">&#41;</span> url.<span style="color: #000000;">openConnection</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
connection.<span style="color: #000000;">setDoOutput</span><span style="color: #009900;">&#40;</span><span style="color: #7F0055; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><span style="color: #3F7F5F; font-style: normal;">// 使用 URL 连接进行输出</span>
connection.<span style="color: #000000;">setRequestMethod</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;POST&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #3F7F5F; font-style: normal;">// 取得输出流</span>
<span style="color: #003399;">OutputStreamWriter</span> writer <span style="color: #339933;">=</span> <span style="color: #7F0055; font-weight: bold;">new</span> <span style="color: #003399;">OutputStreamWriter</span><span style="color: #009900;">&#40;</span>connection.<span style="color: #000000;">getOutputStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #3F7F5F; font-style: normal;">// 用UTF-8编码，保证中文传递正常</span>
<span style="color: #003399;">String</span> message <span style="color: #339933;">=</span> <span style="color: #003399;">URLEncoder</span>.<span style="color: #000000;">encode</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;你好，I'm Fatkun!&quot;</span>, <span style="color: #0000ff;">&quot;UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #3F7F5F; font-style: normal;">// 写入发送的内容</span>
writer.<span style="color: #000000;">write</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;msg=&quot;</span> <span style="color: #339933;">+</span> message<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
writer.<span style="color: #000000;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div><p>上面是主要的代码，看注释好了，都很清楚。</p><h2>Google App Engine中文乱码问题</h2><p>注意在读取中文的网页时，由于编码是使用UTF或者GBK,GB2312等编码，使用InputStream类不太方便，另外有可以出现错误。<br
/> 试过使用InputStream类，然后用new String(bytes[],&#8221;utf-8&#8243;)来转换编码，不过出现一点问题，不知道是我不会用还是怎么的。<br
/> 不过使用这样的写法就方便多了。<br
/> InputStreamReader in = new InputStreamReader(url.openStream(), &#8220;UTF-8&#8243;);<br
/> 编码都不用转换了~指定它的编码就行。<br
/> 注意这里要加上“UTF-8”，虽然不加在本地测试时没问题，不过上传到GAE上就不能显示中文了。<br
/> PS2:这里的UTF-8是代表你抓取网页的编码。如果你抓取的网页是gb2312的需要根据实质需求改变。</p><p>附上我做的例子：<a
href="http://2.latest.fatkuns.appspot.com/">http://2.latest.fatkuns.appspot.com/</a><br
/> 源码在这里：<a
href="http://2.latest.fatkuns.appspot.com/source.rar">http://2.latest.fatkuns.appspot.com/source.rar</a>,里面的lib目录下的我删除了，请自行添加。</p> ]]></content:encoded> <wfw:commentRss>http://fatkun.com/2010/01/get-website-source-using-google-app-engine.html/feed</wfw:commentRss> <slash:comments>11</slash:comments> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using memcached
Page Caching using memcached
Database Caching 9/14 queries in 0.004 seconds using memcached

Served from: www.fatkun.com @ 2010-09-09 08:18:37 -->