Virgil Griffith a Disruptive Technologist at the California Institute of Technology made me interested in one aspect of data mining on the internet. The search for sources of the information we find on the internet. This crucial bit of source reference which we as scientists all time through our careers try to find out, who wrote about a certain topic, or has it been written about before. Internet is kind of a basket containing all types of information and everything can be found within it. It does not discriminate about its originality or how genuine the information is, it only discriminate on its popularity on the internet through its search engines.
Virgil Griffith made a program called WikiScanner, to be found at the internet as http://wikiscanner.virgil.gr/.
For those of you not knowing Virgil Griffith, I will give a short story on him here. I may be wrong and leave out things that should be here, but it is only meant as a short intro to the person, and nothing else.
Virgil Griffith (born 1983 in Alabama USA), also known as Romanpoet, is a hacker, known for his involvement with a 2003 lawsuit with the Blackboard Inc. company and his creation of the WikiScanner. He has also published papers on artificial life. He is a visiting Researcher at the Santa Fe Institute and at the present a graduate student in the Computation and Neural Systems department at Caltech.
Qualification of internet information
Performing a qualification of all information on internet can be a challenge to say it at the least. Someone would call it impossible to perform a check on the information and who is the originator of this information. To just see where Wikipedia gets its information from, you can use the WikiScanner and get an idea at least. Some surprises may pop up as you use the tool, and some results maybe confirm your already established idea of where information comes from. Maybe this gives you a guideline of where information on the internet comes from, I do not know.
WikiScanner results
Griffith found that 34.5 million contributors where anonymous edits, which is around 21% of Wikipedia content contributors in total. There are in excess of 2,668,095 different organizations in database. He found 187,529 different orgs with at least 1 edit. There were some other interesting findings such as; There is a different % of anonymous edits by country The CIA does in fact edit Wikipedia FOIA lawsuit filed over Mike Huckabee white-washing Dutch princess white-washes connections to drug baron. Politicians and corporations do in fact hire staff to police their pages.
Follow-Ups If you are interested in doing data mining for yourself, why not try wikiscanner and in addition use some of these toolboxes available for you. The General Architecture for Text Engineering called GATE; http://gate.ac.uk/ MySQL / Python/ Ruby Text Similarity tools; http://freshmeat.net/projects/levenshtein/ Data-mining tools; http://wiki2.issuecrawler.net/twiki/bin/view/Dmi/DmiTools
Users Reading this article are also interested in;