<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>fatkun&#039;s blog &#187; 中文乱码</title>
	<atom:link href="http://fatkun.com/tag/%e4%b8%ad%e6%96%87%e4%b9%b1%e7%a0%81/feed" rel="self" type="application/rss+xml" />
	<link>http://fatkun.com</link>
	<description>又一个 WordPress 站点</description>
	<lastBuildDate>Sun, 05 Feb 2012 15:21:33 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>GAE-Google App Engine网址抓取(java.net.UrlConnection)</title>
		<link>http://fatkun.com/2010/01/get-website-source-using-google-app-engine.html</link>
		<comments>http://fatkun.com/2010/01/get-website-source-using-google-app-engine.html#comments</comments>
		<pubDate>Sat, 23 Jan 2010 08:17:12 +0000</pubDate>
		<dc:creator>fatkun</dc:creator>
				<category><![CDATA[J2EE]]></category>
		<category><![CDATA[GAE]]></category>
		<category><![CDATA[google app engine]]></category>
		<category><![CDATA[中文乱码]]></category>
		<category><![CDATA[网址抓取]]></category>

		<guid isPermaLink="false">http://fatkun.com/?p=254</guid>
		<description><![CDATA[Google App Engine 的网址抓取挺方便的，可以使用java.net.UrlConnection这个类。有了这个我们可以干什么？例如可以从某处获取天气信息等等~ (提醒一下，上面的是图片。。不要误点了啊。。。) 看看例子：http://2.latest.fatkuns.appspot.com/ GAE网址抓取是什么？ App Engine 应用程序可以抓取资源，并通过互联网使用 HTTP 和 HTTPS 请求与其他主机通信。应用程序使用网址抓取服务来进行请求。 我觉得其实就是可以通过它抓取别人网页的源代码。 使用URL获取源码 package com.fatkun; /** * 在GAE上抓取网址 * @author Fatkun * @site http://fatkun.com */ &#160; import java.io.IOException; import java.io.InputStreamReader; import java.net.URL; &#160; import javax.servlet.http.*; &#160; @SuppressWarnings&#40;&#34;serial&#34;&#41; public class URL2Servlet extends HttpServlet &#123; public void doGet&#40;HttpServletRequest req, HttpServletResponse resp&#41; throws IOException &#123; [...]]]></description>
			<content:encoded><![CDATA[<p>Google App Engine 的网址抓取挺方便的，可以使用java.net.UrlConnection这个类。有了这个我们可以干什么？例如可以从某处获取天气信息等等~<br />
<img src="http://farm3.static.flickr.com/2724/4297317270_500f2c34b3.jpg" alt="" /><br />
(提醒一下，上面的是图片。。不要误点了啊。。。)<br />
看看例子：<a href="http://2.latest.fatkuns.appspot.com/">http://2.latest.fatkuns.appspot.com/</a><br />
<span id="more-254"></span></p>
<h2>GAE网址抓取是什么？</h2>
<blockquote><p>App Engine 应用程序可以抓取资源，并通过互联网使用 HTTP 和 HTTPS 请求与其他主机通信。应用程序使用网址抓取服务来进行请求。</p></blockquote>
<p>我觉得其实就是可以通过它抓取别人网页的源代码。</p>
<h2>使用URL获取源码</h2>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">com.fatkun</span><span style="color: #339933;">;</span>
<span style="color: #008000; font-style: italic; font-weight: bold;">/**
 * 在GAE上抓取网址
 * @author Fatkun
 * @site http://fatkun.com
 */</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.InputStreamReader</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.net.URL</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">javax.servlet.http.*</span><span style="color: #339933;">;</span>
&nbsp;
@SuppressWarnings<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;serial&quot;</span><span style="color: #009900;">&#41;</span>
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> URL2Servlet <span style="color: #000000; font-weight: bold;">extends</span> HttpServlet <span style="color: #009900;">&#123;</span>
	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> doGet<span style="color: #009900;">&#40;</span>HttpServletRequest req, HttpServletResponse resp<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
		resp.<span style="color: #006633;">setContentType</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;text/plain; charset=utf-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">//显示编码</span>
&nbsp;
		<span style="color: #003399;">URL</span> url <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">URL</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;http://fatkun.com&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #666666; font-style: italic;">// 读取源码</span>
		<span style="color: #666666; font-style: italic;">//读取中文时，使用Reader类是每次读出两个字节的，不会出现中文乱码</span>
		<span style="color: #003399;">InputStreamReader</span> in <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">InputStreamReader</span><span style="color: #009900;">&#40;</span>url.<span style="color: #006633;">openStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, <span style="color: #0000ff;">&quot;UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000066; font-weight: bold;">char</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> buf <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #000066; font-weight: bold;">char</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">2048</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">//缓存</span>
		<span style="color: #003399;">StringBuffer</span> sb <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">StringBuffer</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000066; font-weight: bold;">int</span> len <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>len <span style="color: #339933;">=</span> in.<span style="color: #006633;">read</span><span style="color: #009900;">&#40;</span>buf<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #339933;">-</span><span style="color: #cc66cc;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><span style="color: #666666; font-style: italic;">//当没到文档尽头继续读取</span>
			sb.<span style="color: #006633;">append</span><span style="color: #009900;">&#40;</span>buf, <span style="color: #cc66cc;">0</span>, len<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">// 输出在网页上</span>
		resp.<span style="color: #006633;">getWriter</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>sb.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
	<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<h2>使用HttpURLConnection 来POST内容</h2>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">// 此处的地址请换成你的，在本地测试时可以填入http://localhost:8888/request.jsp</span>
<span style="color: #003399;">URL</span> url <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">URL</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;http://2.latest.fatkuns.appspot.com/request.jsp&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #003399;">HttpURLConnection</span> connection <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">HttpURLConnection</span><span style="color: #009900;">&#41;</span> url.<span style="color: #006633;">openConnection</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
connection.<span style="color: #006633;">setDoOutput</span><span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">true</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span><span style="color: #666666; font-style: italic;">// 使用 URL 连接进行输出</span>
connection.<span style="color: #006633;">setRequestMethod</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;POST&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// 取得输出流</span>
<span style="color: #003399;">OutputStreamWriter</span> writer <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">OutputStreamWriter</span><span style="color: #009900;">&#40;</span>connection.<span style="color: #006633;">getOutputStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// 用UTF-8编码，保证中文传递正常</span>
<span style="color: #003399;">String</span> message <span style="color: #339933;">=</span> <span style="color: #003399;">URLEncoder</span>.<span style="color: #006633;">encode</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;你好，I'm Fatkun!&quot;</span>, <span style="color: #0000ff;">&quot;UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #666666; font-style: italic;">// 写入发送的内容</span>
writer.<span style="color: #006633;">write</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;msg=&quot;</span> <span style="color: #339933;">+</span> message<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
writer.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>上面是主要的代码，看注释好了，都很清楚。</p>
<h2>Google App Engine中文乱码问题</h2>
<p>注意在读取中文的网页时，由于编码是使用UTF或者GBK,GB2312等编码，使用InputStream类不太方便，另外有可以出现错误。<br />
试过使用InputStream类，然后用new String(bytes[],”utf-8&#8243;)来转换编码，不过出现一点问题，不知道是我不会用还是怎么的。<br />
不过使用这样的写法就方便多了。<br />
InputStreamReader in = new InputStreamReader(url.openStream(), “UTF-8&#8243;);<br />
编码都不用转换了~指定它的编码就行。<br />
注意这里要加上“UTF-8”，虽然不加在本地测试时没问题，不过上传到GAE上就不能显示中文了。<br />
PS2:这里的UTF-8是代表你抓取网页的编码。如果你抓取的网页是gb2312的需要根据实质需求改变。</p>
<p>附上我做的例子：<a href="http://2.latest.fatkuns.appspot.com/">http://2.latest.fatkuns.appspot.com/</a><br />
源码在这里：<a href="http://2.latest.fatkuns.appspot.com/source.rar">http://2.latest.fatkuns.appspot.com/source.rar</a>,里面的lib目录下的我删除了，请自行添加。</p>
]]></content:encoded>
			<wfw:commentRss>http://fatkun.com/2010/01/get-website-source-using-google-app-engine.html/feed</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
	</channel>
</rss>

