<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Wikipedia Scraper</title>
	<atom:link href="http://blackhatseo-blog.com/wikipedia-scraper/feed" rel="self" type="application/rss+xml" />
	<link>http://blackhatseo-blog.com/wikipedia-scraper</link>
	<description>spam 2.0</description>
	<pubDate>Thu, 04 Dec 2008 23:28:23 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.5</generator>
		<item>
		<title>By: Tom</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-23511</link>
		<dc:creator>Tom</dc:creator>
		<pubDate>Wed, 01 Oct 2008 04:25:51 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-23511</guid>
		<description>Nice function, I played with it for awhile to get what I needed out of it.  My version still isn't perfect, but it strips a lot of stuff that I didn't want out.

function wikipedia($article)	{
	$article=explode(' ',$article);
	$article=implode('_',$article);
	$article="http://en.wikipedia.org/wiki/".urlencode($article);
	$pattern[0] = '/&lt;a href="(.*?)" rel="nofollow"&gt;(.*?)/';
	$replace[0] = '$2';
	$pattern[1] = '/From Wikipedia, the free encyclopedia/';
	$replace[1] = '';
	$pattern[2] = '/(.*?)Jump to: navigation, search/';
	$replace[2] = '';
	$pattern[3] = '/(.*?)/';
	$replace[3] = '';
	$pattern[4] = '/(.*?)/';
	$replace[4] = '';
	$pattern[5] = '/(.*?)/';
	$replace[5] = '';
	$pattern[6] = '/(.*?)/';
	$replace[6] = '$1';
	$pattern[7] = '/(.*?)/';
	$replace[7] = '';
	$pattern[8] = '/(.*?)/';
	$replace[8] = '';
	$pattern[9] = '/(.*?)/';
	$replace[9] = '';
	$pattern[10] = '/(.*?)/';
	$replace[10] = '';
	$pattern[11] = '/(.*?)/';
	$replace[11] = '';
	$pattern[12] = '/(.*?)/';
	$replace[12] = '';
	$pattern[13] = '/(.*?)/';
	$replace[13] = '';
	$pattern[14] = '//';
	$replace[14] = '';
	$pattern[15] = '/(.*?)/';
	$replace[15] = '';
	$pattern[16] = '/(.*?)/';
	$replace[16] = '';
	$pattern[17] = '/(.*?)/';
	$replace[17] = '';
	$pattern[18] = '//';
	$replace[18] = '';
	$pattern[19] = '/\[(.*?)\]/';
	$replace[19] = '';
	$pattern[20] = '/(.*?)/';
	$replace[20] = '';
	$pattern[21] = '//';
	$replace[21] = '';
	$pattern[22] = '/(.*?)/';
	$replace[22] = '';
	$pattern[23] = '/(.*?)/';
	$replace[23] = '';
	$pattern[24] = '/(.*?)/';
	$replace[24] = '';
	$pattern[25] = '/(.*?)/';
	$replace[25] = '';
	$pattern[26] = '/&lt;b&gt;/';
	$replace[26] = '&lt;strong&gt;';
	$pattern[27] = '//';
	$replace[27] = '&lt;/strong&gt;';
	$pattern[28] = '//';
	$replace[28] = '';
	$pattern[29] = '//';
	$replace[29] = '';
	$pattern[30] = '/(.*?)/';
	$replace[30] = '';
	$pattern[31] = '//';
	$replace[31] = '';
	$pattern[32] = '/(.*?)/';
	$replace[32] = '';
	$pattern[33] = '/(.*?)/';
	$replace[33] = '';
	$wikipedia = file_get_contents($article, "r");
	$wikipedia = preg_replace($pattern, $replace, $wikipedia);
		if (preg_match("/(.*)/", $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match("/(.*)&lt;a&gt;/is", $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match("/(.*)/is", $wikipedia, $w)) {
			$wikipedia = $w[1];
		} elseif (preg_match("/(.*)/is", $wikipedia, $w)) {
			$wikipedia = $w[1];
		}
	$wikipedia=explode('name="References"',$wikipedia);
	$wikipedia=$wikipedia[0];
	$wikipedia=preg_replace('@]*?&#62;.*?@si','',$wikipedia);
	$me=$wikipedia;
	$tags  = 'applet&#124;bgsound&#124;blink&#124;body&#124;button&#124;form&#124;frame&#124;frameset&#124;head&#124;table&#124;tr&#124;td&#124;div&#124;img&#124;h2&#124;h1&#124;ul&#124;a&#124;h3&#124;h4&#124;span&#124;ol&#124;li&#124;dl&#124;strong&#124;i&#124;b&#124;';
	$tags .= 'html&#124;iframe&#124;ilayer&#124;input&#124;keygen&#124;label&#124;layer&#124;link&#124;object&#124;optgroup&#124;option&#124;marquee&#124;';
	$tags .= 'meta&#124;noframes&#124;nolayer&#124;noscript&#124;param&#124;select';
	$attribs  = 'onclick&#124;ondblclick&#124;onmousedown&#124;onmouseup&#124;onmouseover&#124;ondragdrop&#124;';
	$attribs .= 'onmousemove&#124;onmouseout&#124;onkeypress&#124;onkeydown&#124;onkeyup&#124;onabort';
	$regex = array('@]*?&#62;.*?@si', '@]*?&#62;.*?@si', "@]*&#62;@i");
	$me = preg_replace($regex, '', $me);
	$regex = "@('\"\s]*)?(.*?/?&#62;)@i";
	$me = preg_replace($regex, '', $me);
	$regex = "@('\"\s]*)?(.*?/?&#62;)@i";
	while ( preg_match($regex, $me) )
	{
     		$me = preg_replace($regex, '$1$2', $me);
	}
	$regex = '@(\'"\s]*javascript:[^&#62;\'"\s]*&#124;\'[^\']*javascript:[^\']*\'&#124;"[^"]*javascript:[^"]*"))(.*?/?&#62;)@i';
	while ( preg_match($regex, $me) )
	{
     		$me = preg_replace($regex, '$1$2', $me);
	}
	return $me;
}


Call it with: (If you connect this to a db, make sure to add a sleep(1); into your loop so wikipedia doesn't block you.


$myvar=wikipedia("My Term");
print $myvar;</description>
		<content:encoded><![CDATA[<p>Nice function, I played with it for awhile to get what I needed out of it.  My version still isn&#8217;t perfect, but it strips a lot of stuff that I didn&#8217;t want out.</p>
<p>function wikipedia($article)	{<br />
	$article=explode(&#8217; &#8216;,$article);<br />
	$article=implode(&#8217;_',$article);<br />
	$article=&#8221;http://en.wikipedia.org/wiki/&#8221;.urlencode($article);<br />
	$pattern[0] = &#8216;/<a href="(.*?)" rel="nofollow">(.*?)/&#8217;;<br />
	$replace[0] = &#8216;$2&#8242;;<br />
	$pattern[1] = &#8216;/From Wikipedia, the free encyclopedia/&#8217;;<br />
	$replace[1] = &#8221;;<br />
	$pattern[2] = &#8216;/(.*?)Jump to: navigation, search/&#8217;;<br />
	$replace[2] = &#8221;;<br />
	$pattern[3] = &#8216;/(.*?)/&#8217;;<br />
	$replace[3] = &#8221;;<br />
	$pattern[4] = &#8216;/(.*?)/&#8217;;<br />
	$replace[4] = &#8221;;<br />
	$pattern[5] = &#8216;/(.*?)/&#8217;;<br />
	$replace[5] = &#8221;;<br />
	$pattern[6] = &#8216;/(.*?)/&#8217;;<br />
	$replace[6] = &#8216;$1&#8242;;<br />
	$pattern[7] = &#8216;/(.*?)/&#8217;;<br />
	$replace[7] = &#8221;;<br />
	$pattern[8] = &#8216;/(.*?)/&#8217;;<br />
	$replace[8] = &#8221;;<br />
	$pattern[9] = &#8216;/(.*?)/&#8217;;<br />
	$replace[9] = &#8221;;<br />
	$pattern[10] = &#8216;/(.*?)/&#8217;;<br />
	$replace[10] = &#8221;;<br />
	$pattern[11] = &#8216;/(.*?)/&#8217;;<br />
	$replace[11] = &#8221;;<br />
	$pattern[12] = &#8216;/(.*?)/&#8217;;<br />
	$replace[12] = &#8221;;<br />
	$pattern[13] = &#8216;/(.*?)/&#8217;;<br />
	$replace[13] = &#8221;;<br />
	$pattern[14] = &#8216;//&#8217;;<br />
	$replace[14] = &#8221;;<br />
	$pattern[15] = &#8216;/(.*?)/&#8217;;<br />
	$replace[15] = &#8221;;<br />
	$pattern[16] = &#8216;/(.*?)/&#8217;;<br />
	$replace[16] = &#8221;;<br />
	$pattern[17] = &#8216;/(.*?)/&#8217;;<br />
	$replace[17] = &#8221;;<br />
	$pattern[18] = &#8216;//&#8217;;<br />
	$replace[18] = &#8221;;<br />
	$pattern[19] = &#8216;/\[(.*?)\]/&#8217;;<br />
	$replace[19] = &#8221;;<br />
	$pattern[20] = &#8216;/(.*?)/&#8217;;<br />
	$replace[20] = &#8221;;<br />
	$pattern[21] = &#8216;//&#8217;;<br />
	$replace[21] = &#8221;;<br />
	$pattern[22] = &#8216;/(.*?)/&#8217;;<br />
	$replace[22] = &#8221;;<br />
	$pattern[23] = &#8216;/(.*?)/&#8217;;<br />
	$replace[23] = &#8221;;<br />
	$pattern[24] = &#8216;/(.*?)/&#8217;;<br />
	$replace[24] = &#8221;;<br />
	$pattern[25] = &#8216;/(.*?)/&#8217;;<br />
	$replace[25] = &#8221;;<br />
	$pattern[26] = &#8216;/<b>/&#8217;;<br />
	$replace[26] = &#8216;<strong>&#8216;;<br />
	$pattern[27] = &#8216;//&#8217;;<br />
	$replace[27] = &#8216;</strong>&#8216;;<br />
	$pattern[28] = &#8216;//&#8217;;<br />
	$replace[28] = &#8221;;<br />
	$pattern[29] = &#8216;//&#8217;;<br />
	$replace[29] = &#8221;;<br />
	$pattern[30] = &#8216;/(.*?)/&#8217;;<br />
	$replace[30] = &#8221;;<br />
	$pattern[31] = &#8216;//&#8217;;<br />
	$replace[31] = &#8221;;<br />
	$pattern[32] = &#8216;/(.*?)/&#8217;;<br />
	$replace[32] = &#8221;;<br />
	$pattern[33] = &#8216;/(.*?)/&#8217;;<br />
	$replace[33] = &#8221;;<br />
	$wikipedia = file_get_contents($article, &#8220;r&#8221;);<br />
	$wikipedia = preg_replace($pattern, $replace, $wikipedia);<br />
		if (preg_match(&#8221;/(.*)/&#8221;, $wikipedia, $w)) {<br />
			$wikipedia = $w[1];<br />
		} elseif (preg_match(&#8221;/(.*)<a>/is&#8221;, $wikipedia, $w)) {<br />
			$wikipedia = $w[1];<br />
		} elseif (preg_match(&#8221;/(.*)/is&#8221;, $wikipedia, $w)) {<br />
			$wikipedia = $w[1];<br />
		} elseif (preg_match(&#8221;/(.*)/is&#8221;, $wikipedia, $w)) {<br />
			$wikipedia = $w[1];<br />
		}<br />
	$wikipedia=explode(&#8217;name=&#8221;References&#8221;&#8216;,$wikipedia);<br />
	$wikipedia=$wikipedia[0];<br />
	$wikipedia=preg_replace(&#8217;@]*?&gt;.*?@si&#8217;,&#8221;,$wikipedia);<br />
	$me=$wikipedia;<br />
	$tags  = &#8216;applet|bgsound|blink|body|button|form|frame|frameset|head|table|tr|td|div|img|h2|h1|ul|a|h3|h4|span|ol|li|dl|strong|i|b|&#8217;;<br />
	$tags .= &#8216;html|iframe|ilayer|input|keygen|label|layer|link|object|optgroup|option|marquee|&#8217;;<br />
	$tags .= &#8216;meta|noframes|nolayer|noscript|param|select&#8217;;<br />
	$attribs  = &#8216;onclick|ondblclick|onmousedown|onmouseup|onmouseover|ondragdrop|&#8217;;<br />
	$attribs .= &#8216;onmousemove|onmouseout|onkeypress|onkeydown|onkeyup|onabort&#8217;;<br />
	$regex = array(&#8217;@]*?&gt;.*?@si&#8217;, &#8216;@]*?&gt;.*?@si&#8217;, &#8220;@]*&gt;@i&#8221;);<br />
	$me = preg_replace($regex, &#8221;, $me);<br />
	$regex = &#8220;@(&#8217;\&#8221;\s]*)?(.*?/?&gt;)@i&#8221;;<br />
	$me = preg_replace($regex, &#8221;, $me);<br />
	$regex = &#8220;@(&#8217;\&#8221;\s]*)?(.*?/?&gt;)@i&#8221;;<br />
	while ( preg_match($regex, $me) )<br />
	{<br />
     		$me = preg_replace($regex, &#8216;$1$2&#8242;, $me);<br />
	}<br />
	$regex = &#8216;@(\&#8217;&#8221;\s]*javascript:[^&gt;\'"\s]*|\&#8217;[^\']*javascript:[^\']*\&#8217;|&#8221;[^"]*javascript:[^"]*&#8221;))(.*?/?&gt;)@i&#8217;;<br />
	while ( preg_match($regex, $me) )<br />
	{<br />
     		$me = preg_replace($regex, &#8216;$1$2&#8242;, $me);<br />
	}<br />
	return $me;<br />
}</a></b></a></p>
<p>Call it with: (If you connect this to a db, make sure to add a sleep(1); into your loop so wikipedia doesn&#8217;t block you.</p>
<p>$myvar=wikipedia(&#8221;My Term&#8221;);<br />
print $myvar;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paulo</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-17488</link>
		<dc:creator>Paulo</dc:creator>
		<pubDate>Mon, 28 Jul 2008 02:48:39 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-17488</guid>
		<description>Some web hosts (such as dreamhost) disable fopen and file_get_contents on their servers.  To get around that, use curl instead:

REPLACE:
$wikipedia = fopen($article, "r");

WITH:
$ch = curl_init();
$timeout = 0; 
curl_setopt ($ch, CURLOPT_URL, $article);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$wikipedia = curl_exec($ch);
curl_close($ch);</description>
		<content:encoded><![CDATA[<p>Some web hosts (such as dreamhost) disable fopen and file_get_contents on their servers.  To get around that, use curl instead:</p>
<p>REPLACE:<br />
$wikipedia = fopen($article, &#8220;r&#8221;);</p>
<p>WITH:<br />
$ch = curl_init();<br />
$timeout = 0;<br />
curl_setopt ($ch, CURLOPT_URL, $article);<br />
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);<br />
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);<br />
$wikipedia = curl_exec($ch);<br />
curl_close($ch);</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: handsomemans</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-9363</link>
		<dc:creator>handsomemans</dc:creator>
		<pubDate>Mon, 14 Apr 2008 19:56:14 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-9363</guid>
		<description>[quote comment=""]Thank you so much[/quote]</description>
		<content:encoded><![CDATA[<blockquote cite="http://blackhatseo-blog.com/wikipedia-scraper#comment-"><p>
Thank you so much</p>
</blockquote>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dan</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-17</link>
		<dc:creator>Dan</dc:creator>
		<pubDate>Fri, 28 Dec 2007 01:42:20 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-17</guid>
		<description>This doesn't really work....</description>
		<content:encoded><![CDATA[<p>This doesn&#8217;t really work&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: masterofpuppets</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-19</link>
		<dc:creator>masterofpuppets</dc:creator>
		<pubDate>Sat, 20 Oct 2007 19:28:04 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-19</guid>
		<description>any standalone flickr scraper hook available, like youtube example here. thanks for contributing.</description>
		<content:encoded><![CDATA[<p>any standalone flickr scraper hook available, like youtube example here. thanks for contributing.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: masterofpuppets</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-18</link>
		<dc:creator>masterofpuppets</dc:creator>
		<pubDate>Sat, 20 Oct 2007 19:25:12 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-18</guid>
		<description>code giving that error.

Parse error: syntax error, unexpected '}' in wikipedia.php on line 85</description>
		<content:encoded><![CDATA[<p>code giving that error.</p>
<p>Parse error: syntax error, unexpected &#8216;}&#8217; in wikipedia.php on line 85</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: cashflowrusty</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-16</link>
		<dc:creator>cashflowrusty</dc:creator>
		<pubDate>Mon, 30 Jul 2007 03:46:25 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-16</guid>
		<description>One note: I am using PHP5 and I get the same "Resource ID #1" error with the script as it currently is. I changed the line:
&lt;code&gt;
$wikipedia = fopen($article, "r");
&lt;/code&gt;
to:
&lt;code&gt;
$wikipedia = file_get_contents($article);
&lt;/code&gt;
It works, but if you do an inexact term search you could get a 403 error page as 'content' so there is still an issue doing it this way...</description>
		<content:encoded><![CDATA[<p>One note: I am using PHP5 and I get the same &#8220;Resource ID #1&#8243; error with the script as it currently is. I changed the line:<br />
<code><br />
$wikipedia = fopen($article, "r");<br />
</code><br />
to:<br />
<code><br />
$wikipedia = file_get_contents($article);<br />
</code><br />
It works, but if you do an inexact term search you could get a 403 error page as &#8216;content&#8217; so there is still an issue doing it this way&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: killertux</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-15</link>
		<dc:creator>killertux</dc:creator>
		<pubDate>Wed, 11 Apr 2007 23:19:50 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-15</guid>
		<description>Interisting, but their demo looks a little rough.  I have created one myself at http://www.killertux.com/node/44 .  It features caching, the option to not display images, and a text only version with no links.  It is avaliable with source code on my site.</description>
		<content:encoded><![CDATA[<p>Interisting, but their demo looks a little rough.  I have created one myself at <a href="http://www.killertux.com/node/44" rel="nofollow">http://www.killertux.com/node/44</a> .  It features caching, the option to not display images, and a text only version with no links.  It is avaliable with source code on my site.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: vbignacio</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-14</link>
		<dc:creator>vbignacio</dc:creator>
		<pubDate>Tue, 03 Apr 2007 02:41:18 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-14</guid>
		<description>i tried it and this is what came up:

Resource id #1</description>
		<content:encoded><![CDATA[<p>i tried it and this is what came up:</p>
<p>Resource id #1</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sinned</title>
		<link>http://blackhatseo-blog.com/wikipedia-scraper#comment-13</link>
		<dc:creator>Sinned</dc:creator>
		<pubDate>Thu, 29 Mar 2007 09:45:03 +0000</pubDate>
		<guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-13</guid>
		<description>One more question...where do you input the "keyword" or "topic" you want the script to scrap for?</description>
		<content:encoded><![CDATA[<p>One more question&#8230;where do you input the &#8220;keyword&#8221; or &#8220;topic&#8221; you want the script to scrap for?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
