<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" > <channel><title>Comments on: Wikipedia Scraper</title> <atom:link href="http://blackhatseo-blog.com/wikipedia-scraper/feed" rel="self" type="application/rss+xml" /><link>http://blackhatseo-blog.com/wikipedia-scraper</link> <description>spam 2.0</description> <lastBuildDate>Mon, 14 Feb 2011 09:57:36 +0000</lastBuildDate> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.1.2</generator> <item><title>By: Tom</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-23511</link> <dc:creator>Tom</dc:creator> <pubDate>Wed, 01 Oct 2008 04:25:51 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-23511</guid> <description>Nice function, I played with it for awhile to get what I needed out of it.  My version still isn&#039;t perfect, but it strips a lot of stuff that I didn&#039;t want out.function wikipedia($article)	{ $article=explode(&#039; &#039;,$article); $article=implode(&#039;_&#039;,$article); $article=&quot;http://en.wikipedia.org/wiki/&quot;.urlencode($article); $pattern[0] = &#039;/&lt;a href=&quot;(.*?)&quot; rel=&quot;nofollow&quot;&gt;(.*?)/&#039;; $replace[0] = &#039;$2&#039;; $pattern[1] = &#039;/From Wikipedia, the free encyclopedia/&#039;; $replace[1] = &#039;&#039;; $pattern[2] = &#039;/(.*?)Jump to: navigation, search/&#039;; $replace[2] = &#039;&#039;; $pattern[3] = &#039;/(.*?)/&#039;; $replace[3] = &#039;&#039;; $pattern[4] = &#039;/(.*?)/&#039;; $replace[4] = &#039;&#039;; $pattern[5] = &#039;/(.*?)/&#039;; $replace[5] = &#039;&#039;; $pattern[6] = &#039;/(.*?)/&#039;; $replace[6] = &#039;$1&#039;; $pattern[7] = &#039;/(.*?)/&#039;; $replace[7] = &#039;&#039;; $pattern[8] = &#039;/(.*?)/&#039;; $replace[8] = &#039;&#039;; $pattern[9] = &#039;/(.*?)/&#039;; $replace[9] = &#039;&#039;; $pattern[10] = &#039;/(.*?)/&#039;; $replace[10] = &#039;&#039;; $pattern[11] = &#039;/(.*?)/&#039;; $replace[11] = &#039;&#039;; $pattern[12] = &#039;/(.*?)/&#039;; $replace[12] = &#039;&#039;; $pattern[13] = &#039;/(.*?)/&#039;; $replace[13] = &#039;&#039;; $pattern[14] = &#039;//&#039;; $replace[14] = &#039;&#039;; $pattern[15] = &#039;/(.*?)/&#039;; $replace[15] = &#039;&#039;; $pattern[16] = &#039;/(.*?)/&#039;; $replace[16] = &#039;&#039;; $pattern[17] = &#039;/(.*?)/&#039;; $replace[17] = &#039;&#039;; $pattern[18] = &#039;//&#039;; $replace[18] = &#039;&#039;; $pattern[19] = &#039;/\[(.*?)\]/&#039;; $replace[19] = &#039;&#039;; $pattern[20] = &#039;/(.*?)/&#039;; $replace[20] = &#039;&#039;; $pattern[21] = &#039;//&#039;; $replace[21] = &#039;&#039;; $pattern[22] = &#039;/(.*?)/&#039;; $replace[22] = &#039;&#039;; $pattern[23] = &#039;/(.*?)/&#039;; $replace[23] = &#039;&#039;; $pattern[24] = &#039;/(.*?)/&#039;; $replace[24] = &#039;&#039;; $pattern[25] = &#039;/(.*?)/&#039;; $replace[25] = &#039;&#039;; $pattern[26] = &#039;/&lt;b&gt;/&#039;; $replace[26] = &#039;&lt;strong&gt;&#039;; $pattern[27] = &#039;//&#039;; $replace[27] = &#039;&lt;/strong&gt;&#039;; $pattern[28] = &#039;//&#039;; $replace[28] = &#039;&#039;; $pattern[29] = &#039;//&#039;; $replace[29] = &#039;&#039;; $pattern[30] = &#039;/(.*?)/&#039;; $replace[30] = &#039;&#039;; $pattern[31] = &#039;//&#039;; $replace[31] = &#039;&#039;; $pattern[32] = &#039;/(.*?)/&#039;; $replace[32] = &#039;&#039;; $pattern[33] = &#039;/(.*?)/&#039;; $replace[33] = &#039;&#039;; $wikipedia = file_get_contents($article, &quot;r&quot;); $wikipedia = preg_replace($pattern, $replace, $wikipedia); if (preg_match(&quot;/(.*)/&quot;, $wikipedia, $w)) { $wikipedia = $w[1]; } elseif (preg_match(&quot;/(.*)&lt;a&gt;/is&quot;, $wikipedia, $w)) { $wikipedia = $w[1]; } elseif (preg_match(&quot;/(.*)/is&quot;, $wikipedia, $w)) { $wikipedia = $w[1]; } elseif (preg_match(&quot;/(.*)/is&quot;, $wikipedia, $w)) { $wikipedia = $w[1]; } $wikipedia=explode(&#039;name=&quot;References&quot;&#039;,$wikipedia); $wikipedia=$wikipedia[0]; $wikipedia=preg_replace(&#039;@]*?&gt;.*?@si&#039;,&#039;&#039;,$wikipedia); $me=$wikipedia; $tags  = &#039;applet&#124;bgsound&#124;blink&#124;body&#124;button&#124;form&#124;frame&#124;frameset&#124;head&#124;table&#124;tr&#124;td&#124;div&#124;img&#124;h2&#124;h1&#124;ul&#124;a&#124;h3&#124;h4&#124;span&#124;ol&#124;li&#124;dl&#124;strong&#124;i&#124;b&#124;&#039;; $tags .= &#039;html&#124;iframe&#124;ilayer&#124;input&#124;keygen&#124;label&#124;layer&#124;link&#124;object&#124;optgroup&#124;option&#124;marquee&#124;&#039;; $tags .= &#039;meta&#124;noframes&#124;nolayer&#124;noscript&#124;param&#124;select&#039;; $attribs  = &#039;onclick&#124;ondblclick&#124;onmousedown&#124;onmouseup&#124;onmouseover&#124;ondragdrop&#124;&#039;; $attribs .= &#039;onmousemove&#124;onmouseout&#124;onkeypress&#124;onkeydown&#124;onkeyup&#124;onabort&#039;; $regex = array(&#039;@]*?&gt;.*?@si&#039;, &#039;@]*?&gt;.*?@si&#039;, &quot;@]*&gt;@i&quot;); $me = preg_replace($regex, &#039;&#039;, $me); $regex = &quot;@(&#039;\&quot;\s]*)?(.*?/?&gt;)@i&quot;; $me = preg_replace($regex, &#039;&#039;, $me); $regex = &quot;@(&#039;\&quot;\s]*)?(.*?/?&gt;)@i&quot;; while ( preg_match($regex, $me) ) { $me = preg_replace($regex, &#039;$1$2&#039;, $me); } $regex = &#039;@(\&#039;&quot;\s]*javascript:[^&gt;\&#039;&quot;\s]*&#124;\&#039;[^\&#039;]*javascript:[^\&#039;]*\&#039;&#124;&quot;[^&quot;]*javascript:[^&quot;]*&quot;))(.*?/?&gt;)@i&#039;; while ( preg_match($regex, $me) ) { $me = preg_replace($regex, &#039;$1$2&#039;, $me); } return $me; }Call it with: (If you connect this to a db, make sure to add a sleep(1); into your loop so wikipedia doesn&#039;t block you.$myvar=wikipedia(&quot;My Term&quot;); print $myvar;</description> <content:encoded><![CDATA[<p>Nice function, I played with it for awhile to get what I needed out of it.  My version still isn&#8217;t perfect, but it strips a lot of stuff that I didn&#8217;t want out.</p><p>function wikipedia($article)	{<br /> $article=explode(&#8216; &#8216;,$article);<br /> $article=implode(&#8216;_&#8217;,$article);<br /> $article=&#8221;http://en.wikipedia.org/wiki/&#8221;.urlencode($article);<br /> $pattern[0] = &#8216;/<a href="(.*?)" rel="nofollow">(.*?)/&#8217;;<br /> $replace[0] = &#8216;$2&#8242;;<br /> $pattern[1] = &#8216;/From Wikipedia, the free encyclopedia/&#8217;;<br /> $replace[1] = &#8221;;<br /> $pattern[2] = &#8216;/(.*?)Jump to: navigation, search/&#8217;;<br /> $replace[2] = &#8221;;<br /> $pattern[3] = &#8216;/(.*?)/&#8217;;<br /> $replace[3] = &#8221;;<br /> $pattern[4] = &#8216;/(.*?)/&#8217;;<br /> $replace[4] = &#8221;;<br /> $pattern[5] = &#8216;/(.*?)/&#8217;;<br /> $replace[5] = &#8221;;<br /> $pattern[6] = &#8216;/(.*?)/&#8217;;<br /> $replace[6] = &#8216;$1&#8242;;<br /> $pattern[7] = &#8216;/(.*?)/&#8217;;<br /> $replace[7] = &#8221;;<br /> $pattern[8] = &#8216;/(.*?)/&#8217;;<br /> $replace[8] = &#8221;;<br /> $pattern[9] = &#8216;/(.*?)/&#8217;;<br /> $replace[9] = &#8221;;<br /> $pattern[10] = &#8216;/(.*?)/&#8217;;<br /> $replace[10] = &#8221;;<br /> $pattern[11] = &#8216;/(.*?)/&#8217;;<br /> $replace[11] = &#8221;;<br /> $pattern[12] = &#8216;/(.*?)/&#8217;;<br /> $replace[12] = &#8221;;<br /> $pattern[13] = &#8216;/(.*?)/&#8217;;<br /> $replace[13] = &#8221;;<br /> $pattern[14] = &#8216;//&#8217;;<br /> $replace[14] = &#8221;;<br /> $pattern[15] = &#8216;/(.*?)/&#8217;;<br /> $replace[15] = &#8221;;<br /> $pattern[16] = &#8216;/(.*?)/&#8217;;<br /> $replace[16] = &#8221;;<br /> $pattern[17] = &#8216;/(.*?)/&#8217;;<br /> $replace[17] = &#8221;;<br /> $pattern[18] = &#8216;//&#8217;;<br /> $replace[18] = &#8221;;<br /> $pattern[19] = &#8216;/\[(.*?)\]/&#8217;;<br /> $replace[19] = &#8221;;<br /> $pattern[20] = &#8216;/(.*?)/&#8217;;<br /> $replace[20] = &#8221;;<br /> $pattern[21] = &#8216;//&#8217;;<br /> $replace[21] = &#8221;;<br /> $pattern[22] = &#8216;/(.*?)/&#8217;;<br /> $replace[22] = &#8221;;<br /> $pattern[23] = &#8216;/(.*?)/&#8217;;<br /> $replace[23] = &#8221;;<br /> $pattern[24] = &#8216;/(.*?)/&#8217;;<br /> $replace[24] = &#8221;;<br /> $pattern[25] = &#8216;/(.*?)/&#8217;;<br /> $replace[25] = &#8221;;<br /> $pattern[26] = &#8216;/<b>/&#8217;;<br /> $replace[26] = &#8216;<strong>&#8216;;<br /> $pattern[27] = &#8216;//&#8217;;<br /> $replace[27] = &#8216;</strong>&#8216;;<br /> $pattern[28] = &#8216;//&#8217;;<br /> $replace[28] = &#8221;;<br /> $pattern[29] = &#8216;//&#8217;;<br /> $replace[29] = &#8221;;<br /> $pattern[30] = &#8216;/(.*?)/&#8217;;<br /> $replace[30] = &#8221;;<br /> $pattern[31] = &#8216;//&#8217;;<br /> $replace[31] = &#8221;;<br /> $pattern[32] = &#8216;/(.*?)/&#8217;;<br /> $replace[32] = &#8221;;<br /> $pattern[33] = &#8216;/(.*?)/&#8217;;<br /> $replace[33] = &#8221;;<br /> $wikipedia = file_get_contents($article, &#8220;r&#8221;);<br /> $wikipedia = preg_replace($pattern, $replace, $wikipedia);<br /> if (preg_match(&#8220;/(.*)/&#8221;, $wikipedia, $w)) {<br /> $wikipedia = $w[1];<br /> } elseif (preg_match(&#8220;/(.*)<a>/is&#8221;, $wikipedia, $w)) {<br /> $wikipedia = $w[1];<br /> } elseif (preg_match(&#8220;/(.*)/is&#8221;, $wikipedia, $w)) {<br /> $wikipedia = $w[1];<br /> } elseif (preg_match(&#8220;/(.*)/is&#8221;, $wikipedia, $w)) {<br /> $wikipedia = $w[1];<br /> }<br /> $wikipedia=explode(&#8216;name=&#8221;References&#8221;&#8216;,$wikipedia);<br /> $wikipedia=$wikipedia[0];<br /> $wikipedia=preg_replace(&#8216;@]*?&gt;.*?@si&#8217;,&#8221;,$wikipedia);<br /> $me=$wikipedia;<br /> $tags  = &#8216;applet|bgsound|blink|body|button|form|frame|frameset|head|table|tr|td|div|img|h2|h1|ul|a|h3|h4|span|ol|li|dl|strong|i|b|&#8217;;<br /> $tags .= &#8216;html|iframe|ilayer|input|keygen|label|layer|link|object|optgroup|option|marquee|&#8217;;<br /> $tags .= &#8216;meta|noframes|nolayer|noscript|param|select&#8217;;<br /> $attribs  = &#8216;onclick|ondblclick|onmousedown|onmouseup|onmouseover|ondragdrop|&#8217;;<br /> $attribs .= &#8216;onmousemove|onmouseout|onkeypress|onkeydown|onkeyup|onabort&#8217;;<br /> $regex = array(&#8216;@]*?&gt;.*?@si&#8217;, &#8216;@]*?&gt;.*?@si&#8217;, &#8220;@]*&gt;@i&#8221;);<br /> $me = preg_replace($regex, &#8221;, $me);<br /> $regex = &#8220;@(&#8216;\&#8221;\s]*)?(.*?/?&gt;)@i&#8221;;<br /> $me = preg_replace($regex, &#8221;, $me);<br /> $regex = &#8220;@(&#8216;\&#8221;\s]*)?(.*?/?&gt;)@i&#8221;;<br /> while ( preg_match($regex, $me) )<br /> {<br /> $me = preg_replace($regex, &#8216;$1$2&#8242;, $me);<br /> }<br /> $regex = &#8216;@(\&#8217;&#8221;\s]*javascript:[^&gt;\'"\s]*|\&#8217;[^\']*javascript:[^\']*\&#8217;|&#8221;[^"]*javascript:[^"]*&#8221;))(.*?/?&gt;)@i&#8217;;<br /> while ( preg_match($regex, $me) )<br /> {<br /> $me = preg_replace($regex, &#8216;$1$2&#8242;, $me);<br /> }<br /> return $me;<br /> }</a></b></a></p><p>Call it with: (If you connect this to a db, make sure to add a sleep(1); into your loop so wikipedia doesn&#8217;t block you.</p><p>$myvar=wikipedia(&#8220;My Term&#8221;);<br /> print $myvar;</p> ]]></content:encoded> </item> <item><title>By: Paulo</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-17488</link> <dc:creator>Paulo</dc:creator> <pubDate>Mon, 28 Jul 2008 02:48:39 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-17488</guid> <description>Some web hosts (such as dreamhost) disable fopen and file_get_contents on their servers.  To get around that, use curl instead:REPLACE: $wikipedia = fopen($article, &quot;r&quot;);WITH: $ch = curl_init(); $timeout = 0; curl_setopt ($ch, CURLOPT_URL, $article); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $wikipedia = curl_exec($ch); curl_close($ch);</description> <content:encoded><![CDATA[<p>Some web hosts (such as dreamhost) disable fopen and file_get_contents on their servers.  To get around that, use curl instead:</p><p>REPLACE:<br /> $wikipedia = fopen($article, &#8220;r&#8221;);</p><p>WITH:<br /> $ch = curl_init();<br /> $timeout = 0;<br /> curl_setopt ($ch, CURLOPT_URL, $article);<br /> curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);<br /> curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);<br /> $wikipedia = curl_exec($ch);<br /> curl_close($ch);</p> ]]></content:encoded> </item> <item><title>By: handsomemans</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-9363</link> <dc:creator>handsomemans</dc:creator> <pubDate>Mon, 14 Apr 2008 19:56:14 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-9363</guid> <description>[quote comment=&quot;&quot;]Thank you so much[/quote]</description> <content:encoded><![CDATA[<blockquote cite="http://blackhatseo-blog.com/wikipedia-scraper#comment-"><p> Thank you so much</p></blockquote> ]]></content:encoded> </item> <item><title>By: Dan</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-17</link> <dc:creator>Dan</dc:creator> <pubDate>Fri, 28 Dec 2007 01:42:20 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-17</guid> <description>This doesn&#039;t really work....</description> <content:encoded><![CDATA[<p>This doesn&#8217;t really work&#8230;.</p> ]]></content:encoded> </item> <item><title>By: masterofpuppets</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-19</link> <dc:creator>masterofpuppets</dc:creator> <pubDate>Sat, 20 Oct 2007 19:28:04 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-19</guid> <description>any standalone flickr scraper hook available, like youtube example here. thanks for contributing.</description> <content:encoded><![CDATA[<p>any standalone flickr scraper hook available, like youtube example here. thanks for contributing.</p> ]]></content:encoded> </item> <item><title>By: masterofpuppets</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-18</link> <dc:creator>masterofpuppets</dc:creator> <pubDate>Sat, 20 Oct 2007 19:25:12 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-18</guid> <description>code giving that error.Parse error: syntax error, unexpected &#039;}&#039; in wikipedia.php on line 85</description> <content:encoded><![CDATA[<p>code giving that error.</p><p>Parse error: syntax error, unexpected &#8216;}&#8217; in wikipedia.php on line 85</p> ]]></content:encoded> </item> <item><title>By: cashflowrusty</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-16</link> <dc:creator>cashflowrusty</dc:creator> <pubDate>Mon, 30 Jul 2007 03:46:25 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-16</guid> <description>One note: I am using PHP5 and I get the same &quot;Resource ID #1&quot; error with the script as it currently is. I changed the line: &lt;code&gt; $wikipedia = fopen($article, &quot;r&quot;); &lt;/code&gt; to: &lt;code&gt; $wikipedia = file_get_contents($article); &lt;/code&gt; It works, but if you do an inexact term search you could get a 403 error page as &#039;content&#039; so there is still an issue doing it this way...</description> <content:encoded><![CDATA[<p>One note: I am using PHP5 and I get the same &#8220;Resource ID #1&#8243; error with the script as it currently is. I changed the line:<br /> <code><br /> $wikipedia = fopen($article, "r");<br /> </code><br /> to:<br /> <code><br /> $wikipedia = file_get_contents($article);<br /> </code><br /> It works, but if you do an inexact term search you could get a 403 error page as &#8216;content&#8217; so there is still an issue doing it this way&#8230;</p> ]]></content:encoded> </item> <item><title>By: killertux</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-15</link> <dc:creator>killertux</dc:creator> <pubDate>Wed, 11 Apr 2007 23:19:50 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-15</guid> <description>Interisting, but their demo looks a little rough.  I have created one myself at http://www.killertux.com/node/44 .  It features caching, the option to not display images, and a text only version with no links.  It is avaliable with source code on my site.</description> <content:encoded><![CDATA[<p>Interisting, but their demo looks a little rough.  I have created one myself at <a href="http://www.killertux.com/node/44" rel="nofollow">http://www.killertux.com/node/44</a> .  It features caching, the option to not display images, and a text only version with no links.  It is avaliable with source code on my site.</p> ]]></content:encoded> </item> <item><title>By: vbignacio</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-14</link> <dc:creator>vbignacio</dc:creator> <pubDate>Tue, 03 Apr 2007 02:41:18 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-14</guid> <description>i tried it and this is what came up:Resource id #1</description> <content:encoded><![CDATA[<p>i tried it and this is what came up:</p><p>Resource id #1</p> ]]></content:encoded> </item> <item><title>By: Sinned</title><link>http://blackhatseo-blog.com/wikipedia-scraper/comment-page-1#comment-13</link> <dc:creator>Sinned</dc:creator> <pubDate>Thu, 29 Mar 2007 09:45:03 +0000</pubDate> <guid isPermaLink="false">http://blackhatseo-blog.com/?p=8#comment-13</guid> <description>One more question...where do you input the &quot;keyword&quot; or &quot;topic&quot; you want the script to scrap for?</description> <content:encoded><![CDATA[<p>One more question&#8230;where do you input the &#8220;keyword&#8221; or &#8220;topic&#8221; you want the script to scrap for?</p> ]]></content:encoded> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk
Database Caching 2/25 queries in 0.008 seconds using memcached
Object Caching 555/589 objects using memcached

Served from: blackhatseo-blog.com @ 2012-02-07 15:54:25 -->
