<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Wikipedia Scraper</title>
	<atom:link href="http://blackhatseo-blog.com/wikipedia-scraper/feed" rel="self" type="application/rss+xml" />
	<link>http://blackhatseo-blog.com/wikipedia-scraper</link>
	<description>spam 2.0</description>
	<lastBuildDate>Wed, 17 Feb 2010 04:16:51 +0100</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Tom</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-23511</link>
		<dc:creator>Tom</dc:creator>
		<pubDate>Wed, 01 Oct 2008 04:25:51 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-23511</guid>
		<description>Nice function, I played with it for awhile to get what I needed out of it.  My version still isn&#039;t perfect, but it strips a lot of stuff that I didn&#039;t want out.

function wikipedia($article)	{
	$article=explode(&#039; &#039;,$article);
	$article=implode(&#039;_&#039;,$article);
	$article=&quot;http://en.wikipedia.org/wiki/&quot;.urlencode($article);
	$pattern[0] = &#039;/&lt;a href=&quot;(.*?)&quot; rel=&quot;nofollow&quot;&gt;(.*?)/&#039;;
	$replace[0] = &#039;$2&#039;;
	$pattern[1] = &#039;/From Wikipedia, the free encyclopedia/&#039;;
	$replace[1] = &#039;&#039;;
	$pattern[2] = &#039;/(.*?)Jump to: navigation, search/&#039;;
	$replace[2] = &#039;&#039;;
	$pattern[3] = &#039;/(.*?)/&#039;;
	$replace[3] = &#039;&#039;;
	$pattern[4] = &#039;/(.*?)/&#039;;
	$replace[4] = &#039;&#039;;
	$pattern[5] = &#039;/(.*?)/&#039;;
	$replace[5] = &#039;&#039;;
	$pattern[6] = &#039;/(.*?)/&#039;;
	$replace[6] = &#039;$1&#039;;
	$pattern[7] = &#039;/(.*?)/&#039;;
	$replace[7] = &#039;&#039;;
	$pattern[8] = &#039;/(.*?)/&#039;;
	$replace[8] = &#039;&#039;;
	$pattern[9] = &#039;/(.*?)/&#039;;
	$replace[9] = &#039;&#039;;
	$pattern[10] = &#039;/(.*?)/&#039;;
	$replace[10] = &#039;&#039;;
	$pattern[11] = &#039;/(.*?)/&#039;;
	$replace[11] = &#039;&#039;;
	$pattern[12] = &#039;/(.*?)/&#039;;
	$replace[12] = &#039;&#039;;
	$pattern[13] = &#039;/(.*?)/&#039;;
	$replace[13] = &#039;&#039;;
	$pattern[14] = &#039;//&#039;;
	$replace[14] = &#039;&#039;;
	$pattern[15] = &#039;/(.*?)/&#039;;
	$replace[15] = &#039;&#039;;
	$pattern[16] = &#039;/(.*?)/&#039;;
	$replace[16] = &#039;&#039;;
	$pattern[17] = &#039;/(.*?)/&#039;;
	$replace[17] = &#039;&#039;;
	$pattern[18] = &#039;//&#039;;
	$replace[18] = &#039;&#039;;
	$pattern[19] = &#039;/\[(.*?)\]/&#039;;
	$replace[19] = &#039;&#039;;
	$pattern[20] = &#039;/(.*?)/&#039;;
	$replace[20] = &#039;&#039;;
	$pattern[21] = &#039;//&#039;;
	$replace[21] = &#039;&#039;;
	$pattern[22] = &#039;/(.*?)/&#039;;
	$replace[22] = &#039;&#039;;
	$pattern[23] = &#039;/(.*?)/&#039;;
	$replace[23] = &#039;&#039;;
	$pattern[24] = &#039;/(.*?)/&#039;;
	$replace[24] = &#039;&#039;;
	$pattern[25] = &#039;/(.*?)/&#039;;
	$replace[25] = &#039;&#039;;
	$pattern[26] = &#039;/&lt;b&gt;/&#039;;
	$replace[26] = &#039;&lt;strong&gt;&#039;;
	$pattern[27] = &#039;//&#039;;
	$replace[27] = &#039;&lt;/strong&gt;&#039;;
	$pattern[28] = &#039;//&#039;;
	$replace[28] = &#039;&#039;;
	$pattern[29] = &#039;//&#039;;
	$replace[29] = &#039;&#039;;
	$pattern[30] = &#039;/(.*?)/&#039;;
	$replace[30] = &#039;&#039;;
	$pattern[31] = &#039;//&#039;;
	$replace[31] = &#039;&#039;;
	$pattern[32] = &#039;/(.*?)/&#039;;
	$replace[32] = &#039;&#039;;
	$pattern[33] = &#039;/(.*?)/&#039;;
	$replace[33] = &#039;&#039;;
	$wikipedia = file_get_contents($article, &quot;r&quot;);
	$wikipedia = preg_replace($pattern, $replace, $wikipedia);
		if (preg_match(&quot;/(.*)/&quot;, $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match(&quot;/(.*)&lt;a&gt;/is&quot;, $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match(&quot;/(.*)/is&quot;, $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match(&quot;/(.*)/is&quot;, $wikipedia, $w)) {
			$wikipedia = $w[1];
		}
	$wikipedia=explode(&#039;name=&quot;References&quot;&#039;,$wikipedia);
	$wikipedia=$wikipedia[0];
	$wikipedia=preg_replace(&#039;@]*?&gt;.*?@si&#039;,&#039;&#039;,$wikipedia);
	$me=$wikipedia;
	$tags  = &#039;applet&#124;bgsound&#124;blink&#124;body&#124;button&#124;form&#124;frame&#124;frameset&#124;head&#124;table&#124;tr&#124;td&#124;div&#124;img&#124;h2&#124;h1&#124;ul&#124;a&#124;h3&#124;h4&#124;span&#124;ol&#124;li&#124;dl&#124;strong&#124;i&#124;b&#124;&#039;;
	$tags .= &#039;html&#124;iframe&#124;ilayer&#124;input&#124;keygen&#124;label&#124;layer&#124;link&#124;object&#124;optgroup&#124;option&#124;marquee&#124;&#039;;
	$tags .= &#039;meta&#124;noframes&#124;nolayer&#124;noscript&#124;param&#124;select&#039;;
	$attribs  = &#039;onclick&#124;ondblclick&#124;onmousedown&#124;onmouseup&#124;onmouseover&#124;ondragdrop&#124;&#039;;
	$attribs .= &#039;onmousemove&#124;onmouseout&#124;onkeypress&#124;onkeydown&#124;onkeyup&#124;onabort&#039;;
	$regex = array(&#039;@]*?&gt;.*?@si&#039;, &#039;@]*?&gt;.*?@si&#039;, &quot;@]*&gt;@i&quot;);
	$me = preg_replace($regex, &#039;&#039;, $me);
	$regex = &quot;@(&#039;\&quot;\s]*)?(.*?/?&gt;)@i&quot;;
	$me = preg_replace($regex, &#039;&#039;, $me);
	$regex = &quot;@(&#039;\&quot;\s]*)?(.*?/?&gt;)@i&quot;;
	while ( preg_match($regex, $me) )
	{
     		$me = preg_replace($regex, &#039;$1$2&#039;, $me);
	}
	$regex = &#039;@(\&#039;&quot;\s]*javascript:[^&gt;\&#039;&quot;\s]*&#124;\&#039;[^\&#039;]*javascript:[^\&#039;]*\&#039;&#124;&quot;[^&quot;]*javascript:[^&quot;]*&quot;))(.*?/?&gt;)@i&#039;;
	while ( preg_match($regex, $me) )
	{
     		$me = preg_replace($regex, &#039;$1$2&#039;, $me);
	}
	return $me;
}


Call it with: (If you connect this to a db, make sure to add a sleep(1); into your loop so wikipedia doesn&#039;t block you.


$myvar=wikipedia(&quot;My Term&quot;);
print $myvar;</description>
		<content:encoded><![CDATA[<p>Nice function, I played with it for awhile to get what I needed out of it.  My version still isn&#8217;t perfect, but it strips a lot of stuff that I didn&#8217;t want out.</p>
<p>function wikipedia($article)	{<br />
	$article=explode(&#8217; &#8216;,$article);<br />
	$article=implode(&#8217;_',$article);<br />
	$article=&#8221;http://en.wikipedia.org/wiki/&#8221;.urlencode($article);<br />
	$pattern[0] = &#8216;/<a href="(.*?)" rel="nofollow">(.*?)/&#8217;;<br />
	$replace[0] = &#8216;$2&#8242;;<br />
	$pattern[1] = &#8216;/From Wikipedia, the free encyclopedia/&#8217;;<br />
	$replace[1] = &#8221;;<br />
	$pattern[2] = &#8216;/(.*?)Jump to: navigation, search/&#8217;;<br />
	$replace[2] = &#8221;;<br />
	$pattern[3] = &#8216;/(.*?)/&#8217;;<br />
	$replace[3] = &#8221;;<br />
	$pattern[4] = &#8216;/(.*?)/&#8217;;<br />
	$replace[4] = &#8221;;<br />
	$pattern[5] = &#8216;/(.*?)/&#8217;;<br />
	$replace[5] = &#8221;;<br />
	$pattern[6] = &#8216;/(.*?)/&#8217;;<br />
	$replace[6] = &#8216;$1&#8242;;<br />
	$pattern[7] = &#8216;/(.*?)/&#8217;;<br />
	$replace[7] = &#8221;;<br />
	$pattern[8] = &#8216;/(.*?)/&#8217;;<br />
	$replace[8] = &#8221;;<br />
	$pattern[9] = &#8216;/(.*?)/&#8217;;<br />
	$replace[9] = &#8221;;<br />
	$pattern[10] = &#8216;/(.*?)/&#8217;;<br />
	$replace[10] = &#8221;;<br />
	$pattern[11] = &#8216;/(.*?)/&#8217;;<br />
	$replace[11] = &#8221;;<br />
	$pattern[12] = &#8216;/(.*?)/&#8217;;<br />
	$replace[12] = &#8221;;<br />
	$pattern[13] = &#8216;/(.*?)/&#8217;;<br />
	$replace[13] = &#8221;;<br />
	$pattern[14] = &#8216;//&#8217;;<br />
	$replace[14] = &#8221;;<br />
	$pattern[15] = &#8216;/(.*?)/&#8217;;<br />
	$replace[15] = &#8221;;<br />
	$pattern[16] = &#8216;/(.*?)/&#8217;;<br />
	$replace[16] = &#8221;;<br />
	$pattern[17] = &#8216;/(.*?)/&#8217;;<br />
	$replace[17] = &#8221;;<br />
	$pattern[18] = &#8216;//&#8217;;<br />
	$replace[18] = &#8221;;<br />
	$pattern[19] = &#8216;/\[(.*?)\]/&#8217;;<br />
	$replace[19] = &#8221;;<br />
	$pattern[20] = &#8216;/(.*?)/&#8217;;<br />
	$replace[20] = &#8221;;<br />
	$pattern[21] = &#8216;//&#8217;;<br />
	$replace[21] = &#8221;;<br />
	$pattern[22] = &#8216;/(.*?)/&#8217;;<br />
	$replace[22] = &#8221;;<br />
	$pattern[23] = &#8216;/(.*?)/&#8217;;<br />
	$replace[23] = &#8221;;<br />
	$pattern[24] = &#8216;/(.*?)/&#8217;;<br />
	$replace[24] = &#8221;;<br />
	$pattern[25] = &#8216;/(.*?)/&#8217;;<br />
	$replace[25] = &#8221;;<br />
	$pattern[26] = &#8216;/<b>/&#8217;;<br />
	$replace[26] = &#8216;<strong>&#8216;;<br />
	$pattern[27] = &#8216;//&#8217;;<br />
	$replace[27] = &#8216;</strong>&#8216;;<br />
	$pattern[28] = &#8216;//&#8217;;<br />
	$replace[28] = &#8221;;<br />
	$pattern[29] = &#8216;//&#8217;;<br />
	$replace[29] = &#8221;;<br />
	$pattern[30] = &#8216;/(.*?)/&#8217;;<br />
	$replace[30] = &#8221;;<br />
	$pattern[31] = &#8216;//&#8217;;<br />
	$replace[31] = &#8221;;<br />
	$pattern[32] = &#8216;/(.*?)/&#8217;;<br />
	$replace[32] = &#8221;;<br />
	$pattern[33] = &#8216;/(.*?)/&#8217;;<br />
	$replace[33] = &#8221;;<br />
	$wikipedia = file_get_contents($article, &#8220;r&#8221;);<br />
	$wikipedia = preg_replace($pattern, $replace, $wikipedia);<br />
		if (preg_match(&#8221;/(.*)/&#8221;, $wikipedia, $w)) {<br />
			$wikipedia = $w[1];<br />
		} elseif (preg_match(&#8221;/(.*)<a>/is&#8221;, $wikipedia, $w)) {<br />
			$wikipedia = $w[1];<br />
		} elseif (preg_match(&#8221;/(.*)/is&#8221;, $wikipedia, $w)) {<br />
			$wikipedia = $w[1];<br />
		} elseif (preg_match(&#8221;/(.*)/is&#8221;, $wikipedia, $w)) {<br />
			$wikipedia = $w[1];<br />
		}<br />
	$wikipedia=explode(&#8217;name=&#8221;References&#8221;&#8216;,$wikipedia);<br />
	$wikipedia=$wikipedia[0];<br />
	$wikipedia=preg_replace(&#8217;@]*?&gt;.*?@si&#8217;,&#8221;,$wikipedia);<br />
	$me=$wikipedia;<br />
	$tags  = &#8216;applet|bgsound|blink|body|button|form|frame|frameset|head|table|tr|td|div|img|h2|h1|ul|a|h3|h4|span|ol|li|dl|strong|i|b|&#8217;;<br />
	$tags .= &#8216;html|iframe|ilayer|input|keygen|label|layer|link|object|optgroup|option|marquee|&#8217;;<br />
	$tags .= &#8216;meta|noframes|nolayer|noscript|param|select&#8217;;<br />
	$attribs  = &#8216;onclick|ondblclick|onmousedown|onmouseup|onmouseover|ondragdrop|&#8217;;<br />
	$attribs .= &#8216;onmousemove|onmouseout|onkeypress|onkeydown|onkeyup|onabort&#8217;;<br />
	$regex = array(&#8217;@]*?&gt;.*?@si&#8217;, &#8216;@]*?&gt;.*?@si&#8217;, &#8220;@]*&gt;@i&#8221;);<br />
	$me = preg_replace($regex, &#8221;, $me);<br />
	$regex = &#8220;@(&#8217;\&#8221;\s]*)?(.*?/?&gt;)@i&#8221;;<br />
	$me = preg_replace($regex, &#8221;, $me);<br />
	$regex = &#8220;@(&#8217;\&#8221;\s]*)?(.*?/?&gt;)@i&#8221;;<br />
	while ( preg_match($regex, $me) )<br />
	{<br />
     		$me = preg_replace($regex, &#8216;$1$2&#8242;, $me);<br />
	}<br />
	$regex = &#8216;@(\&#8217;&#8221;\s]*javascript:[^&gt;\'"\s]*|\&#8217;[^\']*javascript:[^\']*\&#8217;|&#8221;[^"]*javascript:[^"]*&#8221;))(.*?/?&gt;)@i&#8217;;<br />
	while ( preg_match($regex, $me) )<br />
	{<br />
     		$me = preg_replace($regex, &#8216;$1$2&#8242;, $me);<br />
	}<br />
	return $me;<br />
}</a></b></a></p>
<p>Call it with: (If you connect this to a db, make sure to add a sleep(1); into your loop so wikipedia doesn&#8217;t block you.</p>
<p>$myvar=wikipedia(&#8221;My Term&#8221;);<br />
print $myvar;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paulo</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-17488</link>
		<dc:creator>Paulo</dc:creator>
		<pubDate>Mon, 28 Jul 2008 02:48:39 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-17488</guid>
		<description>Some web hosts (such as dreamhost) disable fopen and file_get_contents on their servers.  To get around that, use curl instead:

REPLACE:
$wikipedia = fopen($article, &quot;r&quot;);

WITH:
$ch = curl_init();
$timeout = 0; 
curl_setopt ($ch, CURLOPT_URL, $article);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$wikipedia = curl_exec($ch);
curl_close($ch);</description>
		<content:encoded><![CDATA[<p>Some web hosts (such as dreamhost) disable fopen and file_get_contents on their servers.  To get around that, use curl instead:</p>
<p>REPLACE:<br />
$wikipedia = fopen($article, &#8220;r&#8221;);</p>
<p>WITH:<br />
$ch = curl_init();<br />
$timeout = 0;<br />
curl_setopt ($ch, CURLOPT_URL, $article);<br />
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);<br />
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);<br />
$wikipedia = curl_exec($ch);<br />
curl_close($ch);</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: handsomemans</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-9363</link>
		<dc:creator>handsomemans</dc:creator>
		<pubDate>Mon, 14 Apr 2008 19:56:14 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-9363</guid>
		<description>[quote comment=&quot;&quot;]Thank you so much[/quote]</description>
		<content:encoded><![CDATA[<blockquote cite="http://blackhatseo-blog.com/wikipedia-scraper#comment-"><p>
Thank you so much</p>
</blockquote>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-17</link>
		<dc:creator>Dan</dc:creator>
		<pubDate>Fri, 28 Dec 2007 01:42:20 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-17</guid>
		<description>This doesn&#039;t really work....</description>
		<content:encoded><![CDATA[<p>This doesn&#8217;t really work&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: masterofpuppets</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-19</link>
		<dc:creator>masterofpuppets</dc:creator>
		<pubDate>Sat, 20 Oct 2007 19:28:04 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-19</guid>
		<description>any standalone flickr scraper hook available, like youtube example here. thanks for contributing.</description>
		<content:encoded><![CDATA[<p>any standalone flickr scraper hook available, like youtube example here. thanks for contributing.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: masterofpuppets</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-18</link>
		<dc:creator>masterofpuppets</dc:creator>
		<pubDate>Sat, 20 Oct 2007 19:25:12 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-18</guid>
		<description>code giving that error.

Parse error: syntax error, unexpected &#039;}&#039; in wikipedia.php on line 85</description>
		<content:encoded><![CDATA[<p>code giving that error.</p>
<p>Parse error: syntax error, unexpected &#8216;}&#8217; in wikipedia.php on line 85</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: cashflowrusty</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-16</link>
		<dc:creator>cashflowrusty</dc:creator>
		<pubDate>Mon, 30 Jul 2007 03:46:25 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-16</guid>
		<description>One note: I am using PHP5 and I get the same &quot;Resource ID #1&quot; error with the script as it currently is. I changed the line:
&lt;code&gt;
$wikipedia = fopen($article, &quot;r&quot;);
&lt;/code&gt;
to:
&lt;code&gt;
$wikipedia = file_get_contents($article);
&lt;/code&gt;
It works, but if you do an inexact term search you could get a 403 error page as &#039;content&#039; so there is still an issue doing it this way...</description>
		<content:encoded><![CDATA[<p>One note: I am using PHP5 and I get the same &#8220;Resource ID #1&#8243; error with the script as it currently is. I changed the line:<br />
<code><br />
$wikipedia = fopen($article, "r");<br />
</code><br />
to:<br />
<code><br />
$wikipedia = file_get_contents($article);<br />
</code><br />
It works, but if you do an inexact term search you could get a 403 error page as &#8216;content&#8217; so there is still an issue doing it this way&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: killertux</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-15</link>
		<dc:creator>killertux</dc:creator>
		<pubDate>Wed, 11 Apr 2007 23:19:50 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-15</guid>
		<description>Interisting, but their demo looks a little rough.  I have created one myself at http://www.killertux.com/node/44 .  It features caching, the option to not display images, and a text only version with no links.  It is avaliable with source code on my site.</description>
		<content:encoded><![CDATA[<p>Interisting, but their demo looks a little rough.  I have created one myself at <a href="http://www.killertux.com/node/44" rel="nofollow">http://www.killertux.com/node/44</a> .  It features caching, the option to not display images, and a text only version with no links.  It is avaliable with source code on my site.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: vbignacio</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-14</link>
		<dc:creator>vbignacio</dc:creator>
		<pubDate>Tue, 03 Apr 2007 02:41:18 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-14</guid>
		<description>i tried it and this is what came up:

Resource id #1</description>
		<content:encoded><![CDATA[<p>i tried it and this is what came up:</p>
<p>Resource id #1</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sinned</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-13</link>
		<dc:creator>Sinned</dc:creator>
		<pubDate>Thu, 29 Mar 2007 09:45:03 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-13</guid>
		<description>One more question...where do you input the &quot;keyword&quot; or &quot;topic&quot; you want the script to scrap for?</description>
		<content:encoded><![CDATA[<p>One more question&#8230;where do you input the &#8220;keyword&#8221; or &#8220;topic&#8221; you want the script to scrap for?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
