<?xml version="1.0" encoding="UTF-8"?>
<rss version='2.0' xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Richard Key</title>
    <description>live streamer, developer, lover of movies.</description>
    <link>https://busyrich.silvrback.com/feed</link>
    <atom:link href="https://busyrich.silvrback.com/feed" rel="self" type="application/rss+xml"/>
    <category domain="busyrich.silvrback.com">Content Management/Blog</category>
    <language>en-us</language>
      <pubDate>Sun, 16 Feb 2014 15:58:32 -0700</pubDate>
    <managingEditor>social@busyrich.com (Richard Key)</managingEditor>
      <item>
        <guid>http://busyrich.me/wikirand#1898</guid>
          <pubDate>Sun, 16 Feb 2014 15:58:32 -0700</pubDate>
        <link>http://busyrich.me/wikirand</link>
        <title>Wikirand</title>
        <description>Using Wikipedia for random data</description>
        <content:encoded><![CDATA[<p>Just in case you don&#39;t know, I am fascinated by data. Storing it, searching or sorting it, and, most importantly, generating random representations of it. I have built all kinds of random generators, from simple integers to entire <a href="https://github.com/BusyRich/Wordpress-Data-Generator">Wordpress databases</a>. While the usefulness of such tools can be debated, I will continue to research and build such things.</p>

<p>This subject takes us back to October 2013. I was having dinner with the Modulus crew when someone threw out a comment about using Wikipedia to create sensible random strings like &quot;BlueTacoCar&quot;. Naturally my wheels started turning and I was inspired. After all, it should be easy enough to grab some random article and parse its content into some strings.</p>

<p>So I spent a night and built out the concept. I did some research, played around with some Node modules and  when finished I called it <a href="http://wiki-rand-9458.onmodulus.net/">Wikirand</a>. Of course I am going to fill you in on all the juicy details.</p>

<p>The basic idea is to end up with a string like BlueTacoCar, but with all the text coming from Wikipedia. This is actually fairly common, i.e. using stored data to build random sequences. It&#39;s one of the only ways to get a sensical sort of data, as generating text character by character requires a lot of work to make sure it sounds intelligent. Thankfully Wikipedia has a really <a href="http://en.wikipedia.org/wiki/Special:ApiSandbox">robust API</a>, so getting the data was the least of my worries. For those interested, the request URL ended up being (as code for formatting purposes):</p>
<div class="highlight"><pre><span></span><span class="kd">var</span> <span class="nx">wikipediaURL</span> <span class="o">=</span> <span class="s1">&#39; http://en.wikipedia.org/w/api.php?action=query&amp;prop=revisions&amp;format=json&amp;rvprop=content&amp;rvlimit=1&amp;rvparse=&amp;generator=random&amp;grnnamespace=0&amp;grnlimit=1&#39;</span><span class="p">;</span>
</pre></div>
<h2 id="initial-processing">Initial Processing</h2>

<p>What we are after is words here. Not HTML, paragraphs, or sentences, but just words. Since the article pretty much comes as it appears on Wikipedia, there is a lot of stripping and splitting to be had. This can be done with our good friends regular expressions.</p>

<p>Utilizing regex, stripping HTML tags is easy. Keep in mind we are stripping just the tags, not the content within. We want to keep those juicy words. I also ran into a few pieces that were encased in <code>[]</code>s (lists I believe), so those have to be stripped as well.</p>
<div class="highlight"><pre><span></span><span class="p">[</span>
  <span class="sr">/(&lt;([^&gt;]+)&gt;)/ig</span><span class="p">,</span> <span class="c1">//HTML tags</span>
  <span class="sr">/(\[([^\]]+)\])/g</span> <span class="c1">//Things inside []s</span>
<span class="p">]</span>
</pre></div>
<p>Once all manner of tags are stripped, we remove everything that is not an alphabetic character. Yep, you read right, everything that is not A through Z is removed. This prevents any sort of special character from popping up in our final output, which is not what we want.</p>
<div class="highlight"><pre><span></span><span class="p">[</span>
  <span class="sr">/(&lt;([^&gt;]+)&gt;)/ig</span><span class="p">,</span> <span class="c1">//HTML tags</span>
  <span class="sr">/(\[([^\]]+)\])/g</span><span class="p">,</span> <span class="c1">//Things inside []s</span>
  <span class="sr">/[^A-Za-z\s]/g</span> <span class="c1">//Non-alphabetic characters</span>
<span class="p">]</span>
</pre></div>
<p>All of this processing leaves some extra space, so the last step here is to remove the extra space.</p>
<div class="highlight"><pre><span></span><span class="p">[</span>
  <span class="sr">/(&lt;([^&gt;]+)&gt;)/ig</span><span class="p">,</span> <span class="c1">//HTML tags</span>
  <span class="sr">/(\[([^\]]+)\])/g</span><span class="p">,</span> <span class="c1">//Things inside []s</span>
  <span class="sr">/[^A-Za-z\s]/g</span><span class="p">,</span> <span class="c1">//Non-alphabetic characters</span>
  <span class="sr">/\s{2,}/g</span> <span class="c1">//Extra Space</span>
<span class="p">]</span>
</pre></div>
<p>Combined these regular expressions remove everything but the text. This could be a good place to stop, split everything up, and start slamming words together. However, I like a little more choice in my words...</p>

<h2 id="parts-of-speech">Parts of Speech?</h2>

<p>Observant readers may noticed that our final goal is really a collection of words that are built based on what part of speech each word is. In the example &quot;BlueTacoCar&quot; you have an adjective followed by two nouns. It would make things much cooler if you could just specify a POS for each word and build a string that way.</p>

<p><a href="http://www.lataco.com/taco/king-taco-racing-car-show"> <img alt="Blue Taco Car Visual" src="http://www.lataco.com/taco/wp-content/uploads/king-taco-racecar-2.jpg" /></a><br>
<em>Blue Taco Car?</em></p>

<p>The brute force method here would be a dictionary of words with a defined POS listed for each word. Dictionaries are pretty small, so it is a viable option...but not very flexible. After a little research I came across the concept of <a href="http://en.wikipedia.org/wiki/Brill_tagger">Brill tagging</a>, or an algorithm to determine a word&#39;s POS. </p>

<p>Being not too good at &quot;the maths&quot; I am very thankful this esoteric concept already had a <a href="https://github.com/fortnightlabs/pos-js">Node.js Module</a>. Considering this Module is a port of a <a href="https://github.com/mark-watson/fasttag_v2">Java project</a>, which itself is based on algorithms from a <a href="http://repository.upenn.edu/ircs_reports/191/">PhD thesis</a>, it was extremely surprising.</p>

<p>Anyway, now the entire content of the article can be broken up into words, grouped by their POS. So someone can request &quot;ANN&quot; (adjective, noun, noun) and get something along the lines of &quot;BlueTacoCar.&quot; Sweet.</p>

<h2 id="the-other-stuff">The Other Stuff</h2>

<p>As ridiculous as this solution is, I still wanted to take a few extra minutes and add some improvements. For one, there is a cache, so you are not always requesting a new article. After all, it could take a few seconds to get and parse a large article and there is not real way to know what article you will get.</p>

<p><img alt="random articles, maybe?" src="https://silvrback.s3.amazonaws.com/uploads/e97d1289-e3e7-4938-9dbf-123602ee68fe/Screen%20Shot%202014-02-16%20at%203.42.34%20PM_large.png" /><br>
<em>Just a few options for &quot;random&quot;</em></p>

<p>There is also duplicate removal built in as well. Not that &quot;TacoTacoTaco&quot; isn&#39;t funny sometimes, it is just not very random when you get that 10 times in a row from an article about Tacos. Used words are also removed from the cache, again, to avoid duplicate generation.</p>

<p>So there you have it, <a href="https://git.geekli.st/BusyRich/wikipedia-random">Wikirand</a> in a nutshell...or at least a primer on how it works. I would encourage everyone to check it out and get inspired to try their own hands at creating awesome ways to generate random data. After all, it is fun and a great way to test your various data-driven projects.</p>
]]></content:encoded>
      </item>
      <item>
        <guid>http://busyrich.me/why-blog#1795</guid>
          <pubDate>Wed, 05 Feb 2014 15:29:00 -0700</pubDate>
        <link>http://busyrich.me/why-blog</link>
        <title>Why Blog?</title>
        <description>or my first post</description>
        <content:encoded><![CDATA[<p>Its 11pm and I have finally set up my blog. Its not a new experience. I have set up blogs before and I have cetainly tried my hand at maintaining a content flow. The problem for me has always been trying to force it, over-thinking what I should write about or over-editing what content I do come up with. This is only amplified when looking into the eyes of an empty blog and wondering &quot;what do I put on here first?&quot;</p>

<p>I am sure its a preverbal dilemma. Everything is in place to and ready to go, you just need to get started. In physics its called <a href="http://en.wikipedia.org/wiki/Friction#Static_friction">Static Friction</a>, the force to overcome in order for a static object to start moving. Here, however there are no calculations or fancy formulas to what to scribe on the page.</p>

<p>Then I began thinking. Why do I even want to blog? I know its something a lot of people do and last I checked I was at least 50% human. I also am a firm believer in the web and freely sharing knowledge with those willing to learn. It seems like a blog is the perfect outlet for such activities. But the reasons run deeper than that, and I would like to start the sharing with some thoughts.</p>

<p>A year and a half ago I help found <a href="https://modulus.io">Modulus</a>. Even though it is honestly not that long ago in the scope of a lifetime, I was very different in some ways. It certainly accelerated my skills in web development way beyond a typical software job. I began understanding how to network with people. Of course I also found out business development is not the sort of development I am good at.</p>

<p>What a lot of people miss is the realization to just try. You would think with all the movies with the &quot;you can do anything you put your mind to&quot; message it would be more obvious...I guess not. All you have to do is figure out what you want to do and try it. That is the core concept I learned from my experience so far. Don&#39;t play it safe all the time and don&#39;t worry about failure or rejection.</p>

<p>This brings us back to the original question: why blog?  I want to blog as an outlet. Something others might find useful or entertaining. A break from various work and frustrations of life. Most of all its something I want to challenge myself to try and build.</p>
]]></content:encoded>
      </item>
  </channel>
</rss>