User:Wherebot/Source

Here is the latest code as of 5/5/2007. Unicode does not work with at as of writing.

Here is the source code. This has only been tested on UNIX-like systems, but it should theoretically also work on Windows. Note that the code was not intended for wide distribution, so it is not well-commented. Sorry! Also note that the code requires wget, pywikipediabot ,Yahoo's python search plugin, perl , and the Bot::BasicBot and IPC::Open2 perl modules. You may use the code under the GNU General Public License.

If you want to modify Wherebot to run on a different wiki or language, there are some modifications that need to be made. I have marked where people may want to do so on lines containing the text "#CONFIG."

Please go into edit mode to see the source of the program with proper linebreaks.

Here is the main file, cv-watch.pl. Place it where you wish:

#!/usr/bin/perl
use strict;

#some of the IRC parts of this bot are based off of the Bot::BasicBot sample code

Wherebot->new(channels => ["#en.wikipedia", "#en.wikiversity"], nick=>"Wherebot4", server => "irc.wikimedia.org")->run(); #CONFIG: change Wherebot4 to something unique

package Wherebot;
use base qw/Bot::BasicBot/;
use IPC::Open2;

sub said {
   shift(); #don't care about the first parameter
   our %hash = %{shift()};

   our $rawMessage = $hash{"body"};
   our $channel = $hash{"channel"};
   our $site = $channel;
   $site =~ s&#&&;
   $rawMessage =~ m#02(http://$site.org[^ ]+)#;
   our $url = $1;
#CONFIG: the next four lines are to ignore certain pages. Customize if you like
   if ($url =~ /[Tt]alk:/) {return;}
   if ($url =~ /Sandbox/) {return;}
   if ($url =~ /Articles for deletion/) {return;}
   if ($url =~ /Wikipedia:Introduction/) {return;}
   chop $rawMessage;
   if ($rawMessage =~ /N\x{03}10/) {
#CONFIG: the next four lines are to ignore certain namespaces. Customize if you like.
      if ($url =~ /User:/) {return;}
      if ($url =~ /Wikipedia:/) {return;}
      if ($url =~ /Portal:/) {return;}
      if ($url =~ /Help:/) {return;}
      if ($url =~ /Template:/) {return;}
      if ($url =~ /Category:/) {return;}
      if ($url =~ /Image:/) {return;}

      &act($channel, $url);
   }
}


 sub URLDecode { #From http://glennf.com/writing/hexadecimal.url.encoding.html
    my $theURL = $_[0];
    $theURL =~ tr/+/ /;
    $theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;
    $theURL =~ s/<!--(.|\n)*-->//g;
   return $theURL;
 }




sub act {
   our $misc = "/home/where/misc";
   our $channel = shift;
   our $url = shift;
   $url =~ s#'##g; #just in case, although this would never be necessary
   chop $url;
   our $term = `wget '$url?action=raw' -q -O - | head -n 1`;
   chomp $term;

   our $origUrl = $url;
   $url =~ m#/wiki/(.*)#;
   our $page = $1;
   $url .= "?action=raw";
   $url =~ s#'##g; #shouldn't be a problem, but hey, I'm paranoid
   chomp $term;
   $term = &trim($term); #get it to <100 words so yahoo doesn't go crazy
   if ($term =~ /#redirect/i) {
      return;
   }
   if ($term =~ /^\{/) {
      return;
   }
   if ($term =~ /^</) {
      return;
   }

   $term =~ s#'''##g;
   $term =~ s#''##g;
   $term =~ s#\[\[##g;
   $term =~ s#\]\]##g;
   $term =~ s#\*##g;
   $term =~ s#"##g; #Yahoo chokes on quotes; yes, this will probably return false matches, but it is better than the alternative
   $term =~ s#\(##g;
   $term =~ s#\)##g;
#   if (m#([^\(\)]+)[\(\)]#) { #same thing with parenthesis
#      $term = $1;
#   }

   if (length($term) < 75) {
      return;
   }

   our $firstLine;
   our $n=0;
   while (1) {
      our $pid = open2(*Reader, *Writer, "python", "$misc/search.py", "-t", "web", '"' . $term . '"'); #CONFIG: CHANGE $misc/search2.py to the path to search.py from the Yahoo search API
      $firstLine = <Reader>;
     # print "($url): FL: $firstLine\n";
      if ($firstLine =~ /Internal WebService error, temporarily unavailable/ || $firstLine =~ /^Got an error/) {
        warn "Search failed; retrying\n";
        sleep 60;
        waitpid $pid, 0;
        ++$n;
        if ($n < 3) {
           next;
        }
        else {
           last;
        }
      }
      else {
        waitpid $pid, 0;
        last;
      }
   }

   if (!($firstLine =~ /^No results\s*/)) {
      <Reader>;<Reader>; #skip some lines
      our $from = <Reader>;
      $from =~ s#\s##g;
      if ($from =~ m#^http://en\.wikipedia\.org# || $from =~ m#\.gov# || $from =~ m#^http://en.wikibooks#) {
        return;
      }

      #Get the page in the proper format
      $page = &URLDecode($page);
      $page =~ s#_# #g;

      our $strippedUrl = $from;
      $strippedUrl =~ s#^http://##;
      #print "($page) copyvio from $from\n";

      if ($channel eq "#en.wikipedia") { #CONFIG: change this line according to your language and version
        chdir "$misc/pywikipedia"; #CONFIG: change this line according to where your pywikipedia directory is
      }
      print "Writing\n";
      open APPEND_PY, "|nice -n 10 python append.py";
      print APPEND_PY  "* [[$page]] -- [$from $strippedUrl]. Reported at ~~~~~";
      close APPEND_PY;
   }
}

sub trim { #cut parameter to <100 words
   our $in = shift;
   our @in = split / /, $in;
   our $out = "";
   our $i = 1;
   for (@in) {
      $out .= $_ . " ";
      ++$i;
      if ($i == 99) {
        last;
      }
   }
   chop $out; #get rid of last space
   return $out;
}

The following file, append.py, should go in the pywikipediabot directory.

#!/usr/bin/python

import wikipedia
import sys

site = wikipedia.getSite()
page = wikipedia.Page(site, "User:Where/Sandbox") #CONFIG: Change page
text = page.get()
text = unicode(text + "\n") + unicode(raw_input(), 'utf8')
wikipedia.setAction("Adding a suspected copyright violation") #CONFIG: change edit summary
page.put(text,minorEdit=False)

You need a user-config.py file in the pywikipediabot dir. Here's mine:

mylang='en' #CONFIG: change for your wiki language
usernames['wikipedia']['en']='Wherebot' #CONFIG: change for your wiki, wiki language and username

maxthrottle=2
put_throttle=3

Now run login.py in the pywikipediabot dir.

Finally, run cv-watch.pl.

Content Disclaimer

Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.

  1. The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
  2. There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
  3. It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
  4. Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
  5. Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.