Wikirand

Sun, 16 Feb 2014 15:58:32 -0700

Just in case you don't know, I am fascinated by data. Storing it, searching or sorting it, and, most importantly, generating random representations of it. I have built all kinds of random generators, from simple integers to entire Wordpress databases. While the usefulness of such tools can be debated, I will continue to research and build such things.

This subject takes us back to October 2013. I was having dinner with the Modulus crew when someone threw out a comment about using Wikipedia to create sensible random strings like "BlueTacoCar". Naturally my wheels started turning and I was inspired. After all, it should be easy enough to grab some random article and parse its content into some strings.

So I spent a night and built out the concept. I did some research, played around with some Node modules and when finished I called it Wikirand. Of course I am going to fill you in on all the juicy details.

The basic idea is to end up with a string like BlueTacoCar, but with all the text coming from Wikipedia. This is actually fairly common, i.e. using stored data to build random sequences. It's one of the only ways to get a sensical sort of data, as generating text character by character requires a lot of work to make sure it sounds intelligent. Thankfully Wikipedia has a really robust API, so getting the data was the least of my worries. For those interested, the request URL ended up being (as code for formatting purposes):

var wikipediaURL = ' http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&rvparse=&generator=random&grnnamespace=0&grnlimit=1';

Initial Processing

What we are after is words here. Not HTML, paragraphs, or sentences, but just words. Since the article pretty much comes as it appears on Wikipedia, there is a lot of stripping and splitting to be had. This can be done with our good friends regular expressions.

Utilizing regex, stripping HTML tags is easy. Keep in mind we are stripping just the tags, not the content within. We want to keep those juicy words. I also ran into a few pieces that were encased in []s (lists I believe), so those have to be stripped as well.

[
  /(<([^>]+)>)/ig, //HTML tags
  /(\[([^\]]+)\])/g //Things inside []s
]

Once all manner of tags are stripped, we remove everything that is not an alphabetic character. Yep, you read right, everything that is not A through Z is removed. This prevents any sort of special character from popping up in our final output, which is not what we want.

[
  /(<([^>]+)>)/ig, //HTML tags
  /(\[([^\]]+)\])/g, //Things inside []s
  /[^A-Za-z\s]/g //Non-alphabetic characters
]

All of this processing leaves some extra space, so the last step here is to remove the extra space.

[
  /(<([^>]+)>)/ig, //HTML tags
  /(\[([^\]]+)\])/g, //Things inside []s
  /[^A-Za-z\s]/g, //Non-alphabetic characters
  /\s{2,}/g //Extra Space
]

Combined these regular expressions remove everything but the text. This could be a good place to stop, split everything up, and start slamming words together. However, I like a little more choice in my words...

Parts of Speech?

Observant readers may noticed that our final goal is really a collection of words that are built based on what part of speech each word is. In the example "BlueTacoCar" you have an adjective followed by two nouns. It would make things much cooler if you could just specify a POS for each word and build a string that way.

Blue Taco Car?

The brute force method here would be a dictionary of words with a defined POS listed for each word. Dictionaries are pretty small, so it is a viable option...but not very flexible. After a little research I came across the concept of Brill tagging, or an algorithm to determine a word's POS.

Being not too good at "the maths" I am very thankful this esoteric concept already had a Node.js Module. Considering this Module is a port of a Java project, which itself is based on algorithms from a PhD thesis, it was extremely surprising.

Anyway, now the entire content of the article can be broken up into words, grouped by their POS. So someone can request "ANN" (adjective, noun, noun) and get something along the lines of "BlueTacoCar." Sweet.

The Other Stuff

As ridiculous as this solution is, I still wanted to take a few extra minutes and add some improvements. For one, there is a cache, so you are not always requesting a new article. After all, it could take a few seconds to get and parse a large article and there is not real way to know what article you will get.

Just a few options for "random"

There is also duplicate removal built in as well. Not that "TacoTacoTaco" isn't funny sometimes, it is just not very random when you get that 10 times in a row from an article about Tacos. Used words are also removed from the cache, again, to avoid duplicate generation.

So there you have it, Wikirand in a nutshell...or at least a primer on how it works. I would encourage everyone to check it out and get inspired to try their own hands at creating awesome ways to generate random data. After all, it is fun and a great way to test your various data-driven projects.

Why Blog?

Wed, 05 Feb 2014 15:29:00 -0700

Its 11pm and I have finally set up my blog. Its not a new experience. I have set up blogs before and I have cetainly tried my hand at maintaining a content flow. The problem for me has always been trying to force it, over-thinking what I should write about or over-editing what content I do come up with. This is only amplified when looking into the eyes of an empty blog and wondering "what do I put on here first?"

I am sure its a preverbal dilemma. Everything is in place to and ready to go, you just need to get started. In physics its called Static Friction, the force to overcome in order for a static object to start moving. Here, however there are no calculations or fancy formulas to what to scribe on the page.

Then I began thinking. Why do I even want to blog? I know its something a lot of people do and last I checked I was at least 50% human. I also am a firm believer in the web and freely sharing knowledge with those willing to learn. It seems like a blog is the perfect outlet for such activities. But the reasons run deeper than that, and I would like to start the sharing with some thoughts.

A year and a half ago I help found Modulus. Even though it is honestly not that long ago in the scope of a lifetime, I was very different in some ways. It certainly accelerated my skills in web development way beyond a typical software job. I began understanding how to network with people. Of course I also found out business development is not the sort of development I am good at.

What a lot of people miss is the realization to just try. You would think with all the movies with the "you can do anything you put your mind to" message it would be more obvious...I guess not. All you have to do is figure out what you want to do and try it. That is the core concept I learned from my experience so far. Don't play it safe all the time and don't worry about failure or rejection.

This brings us back to the original question: why blog? I want to blog as an outlet. Something others might find useful or entertaining. A break from various work and frustrations of life. Most of all its something I want to challenge myself to try and build.

Richard Key

Wikirand

Initial Processing

Parts of Speech?

The Other Stuff

Why Blog?