Wikirand

Using Wikipedia for random data

Richard Key

live streamer, developer, lover of movies.

Just in case you don't know, I am fascinated by data. Storing it, searching or sorting it, and, most importantly, generating random representations of it. I have built all kinds of random generators, from simple integers to entire Wordpress databases. While the usefulness of such tools can be debated, I will continue to research and build such things.

This subject takes us back to October 2013. I was having dinner with the Modulus crew when someone threw out a comment about using Wikipedia to create sensible random strings like "BlueTacoCar". Naturally my wheels started turning and I was inspired. After all, it should be easy enough to grab some random article and parse its content into some strings.

So I spent a night and built out the concept. I did some research, played around with some Node modules and when finished I called it Wikirand. Of course I am going to fill you in on all the juicy details.

The basic idea is to end up with a string like BlueTacoCar, but with all the text coming from Wikipedia. This is actually fairly common, i.e. using stored data to build random sequences. It's one of the only ways to get a sensical sort of data, as generating text character by character requires a lot of work to make sure it sounds intelligent. Thankfully Wikipedia has a really robust API, so getting the data was the least of my worries. For those interested, the request URL ended up being (as code for formatting purposes):

var wikipediaURL = ' http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&rvparse=&generator=random&grnnamespace=0&grnlimit=1';

Initial Processing

What we are after is words here. Not HTML, paragraphs, or sentences, but just words. Since the article pretty much comes as it appears on Wikipedia, there is a lot of stripping and splitting to be had. This can be done with our good friends regular expressions.

Utilizing regex, stripping HTML tags is easy. Keep in mind we are stripping just the tags, not the content within. We want to keep those juicy words. I also ran into a few pieces that were encased in []s (lists I believe), so those have to be stripped as well.

[
  /(<([^>]+)>)/ig, //HTML tags
  /(\[([^\]]+)\])/g //Things inside []s
]

Once all manner of tags are stripped, we remove everything that is not an alphabetic character. Yep, you read right, everything that is not A through Z is removed. This prevents any sort of special character from popping up in our final output, which is not what we want.

[
  /(<([^>]+)>)/ig, //HTML tags
  /(\[([^\]]+)\])/g, //Things inside []s
  /[^A-Za-z\s]/g //Non-alphabetic characters
]

All of this processing leaves some extra space, so the last step here is to remove the extra space.

[
  /(<([^>]+)>)/ig, //HTML tags
  /(\[([^\]]+)\])/g, //Things inside []s
  /[^A-Za-z\s]/g, //Non-alphabetic characters
  /\s{2,}/g //Extra Space
]

Combined these regular expressions remove everything but the text. This could be a good place to stop, split everything up, and start slamming words together. However, I like a little more choice in my words...

Parts of Speech?

Observant readers may noticed that our final goal is really a collection of words that are built based on what part of speech each word is. In the example "BlueTacoCar" you have an adjective followed by two nouns. It would make things much cooler if you could just specify a POS for each word and build a string that way.

Blue Taco Car?

The brute force method here would be a dictionary of words with a defined POS listed for each word. Dictionaries are pretty small, so it is a viable option...but not very flexible. After a little research I came across the concept of Brill tagging, or an algorithm to determine a word's POS.

Being not too good at "the maths" I am very thankful this esoteric concept already had a Node.js Module. Considering this Module is a port of a Java project, which itself is based on algorithms from a PhD thesis, it was extremely surprising.

Anyway, now the entire content of the article can be broken up into words, grouped by their POS. So someone can request "ANN" (adjective, noun, noun) and get something along the lines of "BlueTacoCar." Sweet.

The Other Stuff

As ridiculous as this solution is, I still wanted to take a few extra minutes and add some improvements. For one, there is a cache, so you are not always requesting a new article. After all, it could take a few seconds to get and parse a large article and there is not real way to know what article you will get.

random articles, maybe?
Just a few options for "random"

There is also duplicate removal built in as well. Not that "TacoTacoTaco" isn't funny sometimes, it is just not very random when you get that 10 times in a row from an article about Tacos. Used words are also removed from the cache, again, to avoid duplicate generation.

So there you have it, Wikirand in a nutshell...or at least a primer on how it works. I would encourage everyone to check it out and get inspired to try their own hands at creating awesome ways to generate random data. After all, it is fun and a great way to test your various data-driven projects.

Richard Key

live streamer, developer, lover of movies.

February 16, 2014

Subscribe to this blog

You can also find Richard Key on Twitter, GitHub, LinkedIn and Stackoverflow. Get in touch by Email.

Wikirand

Using Wikipedia for random data

Initial Processing

Parts of Speech?

The Other Stuff

Share this article with friends