<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MetaOptimize</title>
	<atom:link href="http://blog.metaoptimize.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.metaoptimize.com</link>
	<description>building machine learning and natural language processing tools</description>
	<lastBuildDate>Wed, 10 Mar 2010 17:17:29 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Lean Startup, and The Stooges</title>
		<link>http://blog.metaoptimize.com/2010/03/10/lean-startup-and-the-stooges/</link>
		<comments>http://blog.metaoptimize.com/2010/03/10/lean-startup-and-the-stooges/#comments</comments>
		<pubDate>Wed, 10 Mar 2010 17:08:46 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=93</guid>
		<description><![CDATA[Okay, I&#8217;m ready.
After reading a handful of articles making tenuous connections between entrepreneurship and music, including :

The Notorious CEO: Ten Startup Commandments from Biggie Smalls
Being like The Sex Pistols can help your startup?

I&#8217;ve decided to come out and share my favorite startup music.
Dirt, by The Stooges, is a proto-punk cut that sprawls for seven-minutes, brooding [...]]]></description>
			<content:encoded><![CDATA[<p>Okay, I&#8217;m ready.</p>
<p>After reading a handful of articles making tenuous connections between entrepreneurship and music, including :</p>
<ul>
<li><a href="http://themetricsystem.rjmetrics.com/2009/08/10/the-notorious-ceo-ten-startup-commandments-from-biggie-smalls/">The Notorious CEO: Ten Startup Commandments from Biggie Smalls</a></li>
<li><a href="http://blog.smartupz.com/2010/03/being-like-sex-pistols-can-help-your.html">Being like The Sex Pistols can help your startup?</a></li>
</ul>
<p>I&#8217;ve decided to come out and share my favorite startup music.</p>
<p>Dirt, by <a href="http://en.wikipedia.org/wiki/The_Stooges">The Stooges</a>, is a <a href="http://www.allmusic.com/cg/amg.dll?p=amg&#038;sql=77:2698">proto-punk</a> cut that sprawls for seven-minutes, brooding and smoldering. It never climaxes or burns out, it just persists and drives forward.</p>
<p>Anyway, I believe this song should be the mantra for boostrappers, in particular those that practice the <a href="http://www.startuplessonslearned.com/">lean</a> <a href="http://groups.google.com/group/lean-startup-circle?pli=1">startup</a> <a href="http://leanstartup.pbworks.com/">methodology</a>.</p>
<ul>
<i>Ooh, I been dirt / And I don&#8217;t care / Cause I&#8217;m burning inside / I&#8217;m just a yearning inside / And I&#8217;m the fire o&#8217; life.</i>
</ul>
<p>Without further ado, <b>DIRT</b>:</p>
<p><object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/zxYXV2RrwIs&#038;hl=en_US&#038;fs=1&#038;"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/zxYXV2RrwIs&#038;hl=en_US&#038;fs=1&#038;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object></p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fblog.metaoptimize.com%2F2010%2F03%2F10%2Flean-startup-and-the-stooges%2F&amp;linkname=Lean%20Startup%2C%20and%20The%20Stooges"><img src="http://blog.metaoptimize.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://blog.metaoptimize.com/2010/03/10/lean-startup-and-the-stooges/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Constitution for Governance of Open-Source Projects (v20100227)</title>
		<link>http://blog.metaoptimize.com/2010/02/27/constitution-for-governance-of-open-source-projects-v20100227/</link>
		<comments>http://blog.metaoptimize.com/2010/02/27/constitution-for-governance-of-open-source-projects-v20100227/#comments</comments>
		<pubDate>Sun, 28 Feb 2010 01:08:09 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Free software]]></category>
		<category><![CDATA[Governance]]></category>
		<category><![CDATA[Open source]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=89</guid>
		<description><![CDATA[Summary
I propose a default &#8220;Constitution for Governance of Open-Source Projects&#8221;.

Background
I recently got involved in the OSQA project, which is a fork of CNPROG, which in turn is a clone of the StackExchange Q&#038;A forum software.
Note that the OSQA project has no formal &#8220;homepage&#8221;, or instructions on how to get involved. I only discovered by chance [...]]]></description>
			<content:encoded><![CDATA[<h1>Summary</h1>
<p>I propose a default &#8220;Constitution for Governance of Open-Source Projects&#8221;.</p>
<hr />
<h1>Background</h1>
<p>I recently got involved in the <a href="http://osqa.net/question/2/where-can-i-get-the-source-code-for-osqa">OSQA</a> project, which is a fork of <a href="http://github.com/cnprog/CNPROG">CNPROG</a>, which in turn is a clone of the <a href="http://stackexchange.com/">StackExchange</a> Q&#038;A forum software.</p>
<p>Note that the OSQA project has no formal &#8220;homepage&#8221;, or instructions on how to get involved. I only discovered by chance that there is a mailing-list (unarchived) and developer chat room. Nor was it immediately clear which OSQA github fork should one use.</p>
<p>This is because OSQA grew organically from one contributor to a handful, and developer involvement was an afterthought in this project. Not that there is anything wrong with that.<br />
However, now that a handful of people are involved in the project, and <a href="http://osqa.net/questions/unanswered/">more people are trying to get involved</a>, we have begun discussing governance and decision-making policies on the mailing list. In fact,<br />
<a href="http://nmrwiki.org/">Evgeny Fadeev</a> poses this very question on <a href="http://stackoverflow.com/questions/2328631/how-to-achieve-effective-democratic-governance-for-an-open-source-project">StackOverflow</a>, and proposes some potential answers.</p>
<p>I believe that, by default, there are some simple but clear principles that should be enunciated. I hereby propose my</p>
<h1>Constitution for Governance of Open-Source Projects (v20100227)</h1>
<p>Let it be affirmed that the primary goal in instituting governance of an open-source project be to ensure the long-term health of the project.</p>
<p>Accordingly, the default bias should be towards openness and inclusiveness.<br />
However, policy should be changed as issues present themselves, in order to maintain the long-term health of the project.</p>
<p>For the model of decision making,  we favor a &#8220;do-ocracy&#8221;.<br />
The people who contribute the most generally command the respect of the community.<br />
Alienating them is the best way to derail the project.</p>
<p>The repository should be open the committers, given that commits can easily be reverted and commit-access easily revoked. This is preferable to alienating potential committers.</p>
<p>To ensure transparency for developers new and old, and allow them to decide their involvement in a project based upon the history of the project, their should be transparency and openess in the inner working of the project. For example, the email archive should be public.</p>
<p>Lastly, let us remember that too much red-tape gets in the way of progress. So red-tape and other barriers to contribution should be avoided, and only added as issues present themselves.</p>
<p>This Constitution can and should be amended as issues present themselves.</p>
<p>Therefore be it resolved.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fblog.metaoptimize.com%2F2010%2F02%2F27%2Fconstitution-for-governance-of-open-source-projects-v20100227%2F&amp;linkname=Constitution%20for%20Governance%20of%20Open-Source%20Projects%20%28v20100227%29"><img src="http://blog.metaoptimize.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://blog.metaoptimize.com/2010/02/27/constitution-for-governance-of-open-source-projects-v20100227/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Why can&#8217;t you pickle generators in Python? A pattern for saving training state</title>
		<link>http://blog.metaoptimize.com/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/</link>
		<comments>http://blog.metaoptimize.com/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/#comments</comments>
		<pubDate>Tue, 22 Dec 2009 08:52:13 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[experimental control]]></category>
		<category><![CDATA[Generator]]></category>
		<category><![CDATA[generators]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[persistance]]></category>
		<category><![CDATA[pickling]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[serialization]]></category>
		<category><![CDATA[training state]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=72</guid>
		<description><![CDATA[Summary

A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.
I would also try generator_tools, which might be a more convenient alternative to the pattern I describe. I haven&#8217;t used it yet.

Generators for streaming training examples
For machine learning, python generators are a [...]]]></description>
			<content:encoded><![CDATA[<h1>Summary</h1>
<p><a href="http://flickr.com/photos/28402283@N07/3186143355" title="Moon Rise behind the San Gorgonio Pass Wind Farm"><img align=right src="http://farm4.static.flickr.com/3118/3186143355_4840fb7620_t.jpg" /></a></p>
<p>A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.</p>
<p>I would also try <a href="http://www.fiber-space.de/generator_tools/doc/generator_tools.html">generator_tools</a>, which might be a more convenient alternative to the pattern I describe. I haven&#8217;t used it yet.</p>
<hr />
<h2>Generators for streaming training examples</h2>
<p>For machine learning, python <a href="http://www.ibm.com/developerworks/library/l-pycon.html">generators</a> are a simple idiom that make it easy to generate a stream of training examples. Moreover, you can nest generators:</p>
<ul>
<li>The inner generator can be used to read one example at a time.</li>
<li>The outer generator can be used to read examples from the inner generator until you have a full minibatch, and then yield this minibatch.</li>
</ul>
<p>Here is some example code:</p>
<p>[Update: The example holds without the ALL CAPS magic variable names, "HYPERPARAMETERS". However, I include HYPERPARAMETERS because I am including the actual code I am using. Hyperparameters are global, read-only variables that specify the particular experimental condition being tested. I can't say that I have the best solution to this particular aspect of experimental control (hyperparameters). I might write a blog post about it in the future, to solicit feedback on improved methods. However, I have refined my current approach over several years, and I can assure you that it is far less painful than a handful of more "clean" approaches.]</p>
<pre>def get_train_example():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")

    from vocabulary import wordmap
    for l in myopen(HYPERPARAMETERS["TRAIN_SENTENCES"]):
        prevwords = []
        for w in string.split(l):
            w = string.strip(w)
            id = None
            if wordmap.exists(w):
                prevwords.append(wordmap.id(w))
                if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                    yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
            else:
                prevwords = []

def get_train_minibatch():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")
    minibatch = []
    for e in get_train_example():
        minibatch.append(e)
        if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
            assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
            yield minibatch
            minibatch = []
</pre>
<h2>You can&#8217;t persist training state by pickling your generators</h2>
<p>However, generators become problematic when you want to persist your experiment&#8217;s state in order to later restart training at the same place. Unfortunately, <a href="http://bugs.python.org/issue1092962">you can&#8217;t pickle generators in Python</a>. And it can be a bit of a <a href="http://en.wiktionary.org/wiki/pain_in_the_ass">PITA</a> to workaround this, in order to save the training state.</p>
<h2>Pattern to workaround this annoyance</h2>
<p>Following useful discussion on <a href="http://groups.google.com/group/pylearn-dev/browse_thread/thread/c4e4dd3496bbbf08">pylearn-dev</a> and stackoverflow <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator">[1]</a> <a href="http://stackoverflow.com/questions/1939015/singleton-python-generator-or-pickle-a-python-generator">[2]</a>, I propose the following pattern for converting generators to pickle-able class objects:</p>
<ol>
<li>Convert the generator to a class in which the generator code is the <a href="http://stackoverflow.com/questions/1942328/add-a-member-variable-method-to-a-python-generator/1942387#1942387">__iter__</a> method</li>
<li>Add <a href="http://docs.python.org/library/pickle.html#object.__getstate__">__getstate__</a> and <a href="http://docs.python.org/library/pickle.html#object.__setstate__">__setstate__</a> methods to the class, to handling pickling. Remember that you can&#8217;t pickle file objects. So __setstate__ will have to re-open files, as necessary.</li>
</ol>
<p>Here is the updated code, after applying this pattern:</p>
<pre>
class TrainingExampleStream(object):
    def __init__(self):
        # Set the state variables, in case pickling happens before __iter__ is called.
        self.filename = None
        self.count = 0
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        from vocabulary import wordmap
        self.filename = HYPERPARAMETERS["TRAIN_SENTENCES"]
        self.count = 0
        for l in myopen(self.filename):
            prevwords = []
            for w in string.split(l):
                w = string.strip(w)
                id = None
                if wordmap.exists(w):
                    prevwords.append(wordmap.id(w))
                    if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                        self.count += 1
                        yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
                else:
                    prevwords = []

    def __getstate__(self):
        return self.filename, self.count

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.  If we wanted
        to be really fastidious, we would assume that
        HYPERPARAMETERS["TRAIN_SENTENCES"] might change.  The only
        problem is that if we change filesystems, the filename
        might change just because the base file is in a different
        path. So we issue a warning if the filename is different from what is expected.
        """
        filename, count = state
        print >> sys.stderr, ("__setstate__(%s)..." % `state`)
        iter = self.__iter__()
        while count != self.count:
#            print count, self.count
            iter.next()
        if self.filename != filename:
            assert self.filename == HYPERPARAMETERS["TRAIN_SENTENCES"]
            print >> sys.stderr, ("self.filename %s != filename given to __setstate__ %s" % (self.filename, filename))
        print >> sys.stderr, ("...__setstate__(%s)" % `state`)

class TrainingMinibatchStream(object):
    def __init__(self):
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        minibatch = []
        self.get_train_example = TrainingExampleStream()
        for e in self.get_train_example:
            minibatch.append(e)
            if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
                assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
                yield minibatch
                minibatch = []

    def __getstate__(self):
        return (self.get_train_example.__getstate__(),)

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.
        """
        self.get_train_example = TrainingExampleStream()
        self.get_train_example.__setstate__(state[0])
</pre>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fblog.metaoptimize.com%2F2009%2F12%2F22%2Fwhy-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state%2F&amp;linkname=Why%20can%26%238217%3Bt%20you%20pickle%20generators%20in%20Python%3F%20A%20pattern%20for%20saving%20training%20state"><img src="http://blog.metaoptimize.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://blog.metaoptimize.com/2009/12/22/why-cant-you-pickle-generators-in-python-workaround-pattern-for-saving-training-state/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Use flag &#8211;xml when you run mysqldump</title>
		<link>http://blog.metaoptimize.com/2009/10/14/use-flag-xml-when-you-run-mysqldump/</link>
		<comments>http://blog.metaoptimize.com/2009/10/14/use-flag-xml-when-you-run-mysqldump/#comments</comments>
		<pubDate>Wed, 14 Oct 2009 22:40:17 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[JSON]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[mysqldump]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=60</guid>
		<description><![CDATA[Summary:

If you have text data (like a web scrape) stored in a MySQL database, and you want to share the data, mysqldump to XML using the --xml flag.

When fields are unlikely to contain tabs, an even simpler format is a tab-separated file, created using the --tab=path flag to mysqldump. path must be owned by the [...]]]></description>
			<content:encoded><![CDATA[<h1>Summary:</h1>
<p><a href="http://flickr.com/photos/24030756@N05/2649856228" title="psychogenic womb memory-gemini project"><img src="http://farm4.static.flickr.com/3223/2649856228_61b5405cfa_t.jpg" align="right"></a></p>
<p>If you have text data (like a <a class="zem_slink" href="http://en.wikipedia.org/wiki/Screen_scraping" title="Screen scraping" rel="wikipedia">web scrape</a>) stored in a <a class="zem_slink" href="http://www.mysql.com" title="MySQL" rel="homepage">MySQL</a> database, and you want to share the data, mysqldump to <a class="zem_slink" href="http://en.wikipedia.org/wiki/XML" title="XML" rel="wikipedia">XML</a> using the <tt>--xml</tt> flag.</p>
</p>
<p>When fields are unlikely to contain tabs, an even simpler format is a tab-separated file, created using the <tt>--tab=path</tt> flag to mysqldump. <tt>path</tt> must be owned by the MySQL database user.
</p>
<h1>The Problem with the standard MySQL dump format</h1>
<p>The standard MySQL dump looks as follows</p>
<pre><code>INSERT INTO `sources` VALUES (1,'2009-03-07 22:06:36','"You\'ve got to be kidding me"', ...
</code></pre>
<p>The problem is that the standard dump format is difficult to interact with programmatically.</p>
<p>It is difficult to parse using <a class="zem_slink" href="http://en.wikipedia.org/wiki/Regular_expression" title="Regular expression" rel="wikipedia">regular expressions</a> because you cannot merely search for single quotes. You have to search for single quotes that are not preceded by a <a href="http://en.wikipedia.org/wiki/Backslash">backslash</a> (unless, perhaps, that backslash is preceded by a backslash).</p>
<p>Also, there are no libraries for reading the standard dump format, nor scripts for converting it into a standard format like <a class="zem_slink" href="http://en.wikipedia.org/wiki/JSON" title="JSON" rel="wikipedia">JSON</a> or XML. I asked <a href="http://www.google.com/search?q=mysql+dump+library&amp;hl=en">the oracle</a> as well as <a href="http://stackoverflow.com/questions/1568838/library-to-read-a-mysql-dump">stackoverflow</a>.</p>
<p>So if you receive a MySQL dump in the standard format, you might have to install MySQL and import the dump to get at your data.</p>
<h1>The tabbed MySQL dump format</h1>
<p>You can create a directory with one file per table, and the table will be one-row-per-line, with <a class="zem_slink" href="http://en.wikipedia.org/wiki/Delimiter-separated_values" title="Delimiter-separated values" rel="wikipedia">tab-separated values</a>:</p>
<pre><code>mysqldump --tab=path database</code></pre>
<p>Here is some example output:</p>
<pre><code>1	2009-03-07 22:06:36	"You've got to be kidding me"</code></pre>
<p>If you get an error of the following form when you issue the mysqldump command:</p>
<pre><code>mysqldump: Got error: 1: Can't create/write to file 'path/database.txt' (Errcode: 13) when executing 'SELECT INTO OUTFILE'</code></pre>
<p>You can resolve this complaint by making sure that /tmp/path is owned by the mysql user (and also writeable by the current Unix user). Thanks <a href="http://forums.mysql.com/read.php?35,172714,172766#msg-172766">JinRong Ye</a>!</p>
<p>This format is convenient if none of your data contains tabs. In <a class="zem_slink" href="http://en.wikipedia.org/wiki/Natural_language_processing" title="Natural language processing" rel="wikipedia">NLP</a>, however, it is quite possible that your text will contain tabs.</p>
<h1>The XML MySQL dump format</h1>
<p>Enter the XML MySQL dump format:</p>
<pre><code>        &lt;table_data name="sources"&gt;
        &lt;row&gt;
                &lt;field name="id"&gt;1&lt;/field&gt;
                &lt;field name="created_at"&gt;2009-03-07 22:06:36&lt;/field&gt;
                &lt;field name="text"&gt;&amp;quot;You've got to be kidding me&amp;quot;&lt;/field&gt;
</code></pre>
<p>Ah&#8230; pure bliss. You can get the XML dump format as follows:</p>
<pre><code>mysqldump --xml database</code></pre>
<div class="zemanta-pixie" style="margin-top:10px;height:15px"><a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/41468938-de30-448c-ac95-b381457c48c8/" title="Reblog this post [with Zemanta]"><img class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=41468938-de30-448c-ac95-b381457c48c8" alt="Reblog this post [with Zemanta]" style="border:none;float:right"></a><span class="zem-script more-related pretty-attribution"><script type="text/javascript" src="http://static.zemanta.com/readside/loader.js" defer="defer"></script></span></div>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fblog.metaoptimize.com%2F2009%2F10%2F14%2Fuse-flag-xml-when-you-run-mysqldump%2F&amp;linkname=Use%20flag%20%26%238211%3Bxml%20when%20you%20run%20mysqldump"><img src="http://blog.metaoptimize.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://blog.metaoptimize.com/2009/10/14/use-flag-xml-when-you-run-mysqldump/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Automatically sorting graph curves</title>
		<link>http://blog.metaoptimize.com/2009/09/17/automatically-sorting-graph-curves/</link>
		<comments>http://blog.metaoptimize.com/2009/09/17/automatically-sorting-graph-curves/#comments</comments>
		<pubDate>Thu, 17 Sep 2009 22:16:09 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[gnuplot]]></category>
		<category><![CDATA[Heuristics]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=29</guid>
		<description><![CDATA[A script for automatically sorting graph curves, e.g. for gnuplot.]]></description>
			<content:encoded><![CDATA[<h1>Summary</h1>
<p>A script for automatically sorting graph curves, e.g. for <a class="zem_slink" href="http://www.gnuplot.info/" title="Gnuplot" rel="homepage">gnuplot</a>.</p>
<h1>Problem</h1>
<p>When you have a bunch of curves, and you plot them in an arbitrary order, you might get the following:</p>
<p><img src="http://blog.metaoptimize.com/wp-content/uploads/2009/09/example1-unsorted.png"></p>
<p>Typically, you want to sort the graphs in what appears to be visually descending order, as follows:</p>
<p><img src="http://blog.metaoptimize.com/wp-content/uploads/2009/09/example1-sorted.png"></p>
<p>Sorting the curves is usually done manually, by eyeballing the curves. However, manual sorting of graph curves can become tedious. And when some curves don&#8217;t go out as far on the x-axis, it can be even trickier to place these short curves. (Some curves might be short if this experimental run trains more slowly.)</p>
<h1>Heuristic approach</h1>
<p>An automatic heuristic sorting approach is as follows:</p>
<ul>
<li>We maintain a sorted list of curves, from highest to lowest. The sorted list is initialized to empty.
</li>
<li>At each iteration, we find the curve that goes the furthest out on the x-axis, but is not yet in the sorted list. We then will choose where to insert it into the sorted list.
<ul>
<li>For this curve and all curves in the sorted list, we want an estimate of the curve value at the current curve&#8217;s furthest x-value. We compute this estimate using a <a class="zem_slink" href="http://en.wikipedia.org/wiki/Moving_average" title="Moving average" rel="wikipedia">moving average</a>. (For this reason, all curves should have aligned x-axis steps, and should have equidistant x-axis steps.)</li>
<li>We place this curve into the sorted list, to minimize the number of rank errors of curve estimates at this x-value.</li>
</ul>
</li>
</ul>
<p>And that&#8217;s it!</p>
<h1>Example output</h1>
<p>Here is the sorted output of a larger, more difficult example, sorted using the above heuristic. Click on this image to get a larger version you can inspect:<br />
<a href="http://blog.metaoptimize.com/wp-content/uploads/2009/09/example2-sorted.png"><img src="http://blog.metaoptimize.com/wp-content/uploads/2009/09/example2-sorted-small.png"></a><br />
A few of the decisions aren&#8217;t good. For example, why is curve 15 placed about curve 6? But most of the decisions are reasonable. For example, curve 13 is placed at the bottom, because it is very low compared to the other curves for the short duration that curve 13 is present.</p>
<h1>Code</h1>
<p>I have written a script implementing the heuristic above.</p>
<p>Here is the latest version of <a href="http://github.com/turian/common-scripts/blob/master/sort-curves.py">sort-curves.py</a>.<br />
You will also need <a href="http://github.com/turian/common/blob/master/movingaverage.py">movingaverage.py</a> from my <a class="zem_slink" href="http://www.python.org/" title="Python (programming language)" rel="homepage">Python</a> common library.</p>
<p>USAGE:</p>
<pre><code>./sort-curves.py *.dat
</code></pre>
<p>where every *.dat is in standard (gnuplot) two-column-per-line format:</p>
<pre><code>xvalue yvalue
</code></pre>
<p>Overall, I find this script a useful timesaver.</p>
<div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"><a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/036db446-2e02-4881-94e1-41d7d839bf8d/" title="Reblog this post [with Zemanta]"><img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=036db446-2e02-4881-94e1-41d7d839bf8d" alt="Reblog this post [with Zemanta]"></a><span class="zem-script more-related pretty-attribution"><script type="text/javascript" src="http://static.zemanta.com/readside/loader.js" defer="defer"></script></span></div>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fblog.metaoptimize.com%2F2009%2F09%2F17%2Fautomatically-sorting-graph-curves%2F&amp;linkname=Automatically%20sorting%20graph%20curves"><img src="http://blog.metaoptimize.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://blog.metaoptimize.com/2009/09/17/automatically-sorting-graph-curves/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fast deserialization in Python</title>
		<link>http://blog.metaoptimize.com/2009/03/22/fast-deserialization-in-python/</link>
		<comments>http://blog.metaoptimize.com/2009/03/22/fast-deserialization-in-python/#comments</comments>
		<pubDate>Mon, 23 Mar 2009 02:48:31 +0000</pubDate>
		<dc:creator>Joseph Turian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[JSON]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[YAML]]></category>

		<guid isPermaLink="false">http://blog.metaoptimize.com/?p=5</guid>
		<description><![CDATA[All standard YMMV disclaimers apply.
Update (20090324-2): According to John Millikin, the author of jsonlib, cjson is buggy and unmaintained. I will evaluate further and post a followup blog entry. My discussion with Dan Pascu, the author of cjson, corroborates these claims. I urge readers to read John Millikin&#8217;s comment.
Summary:
For quickly deserializing data in Python, use [...]]]></description>
			<content:encoded><![CDATA[<p><em>All standard YMMV disclaimers apply</em>.</p>
<p><b>Update (20090324-2):</b> According to <a href="http://news.ycombinator.com/item?id=529104">John Millikin</a>, the author of jsonlib, cjson is buggy and unmaintained. I will evaluate further and post a followup blog entry. My discussion with Dan Pascu, the author of cjson, corroborates these claims. I urge readers to read John Millikin&#8217;s comment.</p>
<h1>Summary:</h1>
<p><del>For quickly deserializing data in Python, use <a href="http://pypi.python.org/pypi/python-cjson">cjson</a>.</del><br />
simplejson is mysteriously slow on certain installations.</p>
<p><b>Update (20090324):</b> According to <a href="http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html" rel="nofollow">Extra Cheese</a>, cjson 1.0.5 has an incompatibility with simplejson in processing slashes. A fix is available from <a href="http://www.vazor.com/cjson.html" rel="nofollow">Matt Billenstein</a>. However, Dan Pascu, the author of cjson, deprecates Matt Billenstein&#8217;s cjson 1.0.6 because Matt&#8217;s patch parses the JSON twice, which makes it twice as slow. This will still be faster than all alternatives in certain circumstances. You will not find Matt&#8217;s cjson on the cheeseshop, only on Matt&#8217;s site.
</p>
<h1>Abstract:</h1>
<p>We were initially using simplejson for our work, because the <a href="http://json.org/">JSON</a> format is human-readable and because anecdotal evidence from the blogosphere touted simplejson&#8217;s new C speedups.  We observed that simplejson was actually quite slow on one of our installation environments. This observation prompted to do this study.  We found the cjson consistently achieves the fastest deserialization performance.  We still do not understand why simplejson is slow in certain installation environments.</p>
<h1>Approach:</h1>
<p>We compared the following serialization approaches:</p>
<ul>
<li><a href="http://pypi.python.org/pypi/simplejson">simplejson</a> 2.0.9, with C speedups</li>
<li><a href="http://pypi.python.org/pypi/jsonlib/">jsonlib</a> 1.3.10</li>
<li><a href="http://pypi.python.org/pypi/python-cjson">cjson</a> 1.0.5</li>
<li><a href="http://pyyaml.org/wiki/PyYAML">PyYAML</a> 3.05 with <a href="http://pyyaml.org/wiki/LibYAML">libyaml</a> 0.1.1/0.1.2 C bindings. (We used 0.1.1 on dormeur and 0.1.2 on mammouth.)</li>
<li><a href="http://pyyaml.org/wiki/PySyck">PySyck</a> 0.61.2 with <a href="http://whytheluckystiff.net/syck/">syck</a> 0.55 C bindings. Note that PySyck did not compile until we followed the advice in <a href="http://pyyaml.org/ticket/67">this ticket</a>.</li>
<li>Google <a href="http://code.google.com/p/protobuf/">protobuf</a> 2.0.3</li>
<li>Python <a href="http://docs.python.org/library/pickle.html">pickle</a>, protocol=-1 (binary)</li>
<li>Python pickle, protocol=0 (text)</li>
</ul>
<p>We have not tried the following serialization approaches:</p>
<ul>
<li>Python <a href="http://docs.python.org/library/marshal.html">marshall</a>, which is supposedly much faster than Python pickle. On the downside, the marshal format may change between Python versions.</li>
<li>Native Python, i.e. reading the repr() of the data as a module</li>
<li>XML implementations</li>
<li>Facebook <a href="http://incubator.apache.org/thrift/">thrift</a></li>
<li>Hand-coding C serialization</li>
</ul>
<h1>Experiments:</h1>
<h2>Data:</h2>
<p>We were working with a data structure we call the &#8220;vocabulary&#8221;. The vocabulary is a list of vocabulary terms. Each vocabulary term in turn contained a list of term forms. An example vocabulary term is as follows:</p>
<pre><code>{
    "term class": "the propos delet",
    "canonical form": "the proposed deletion",
    "rank": 3590,
    "count": 7180.0,
    "term forms": [
        { "form": "the proposed deletion", "count": 7153.333333333333 },
        { "form": "the proposed deletions", "count": 13.666666666666666 },
        { "form": "The proposed deletion", "count": 12.0 },
        { "form": "the proposed deletes", "count": 1.0 }
    ]
}
</code></pre>
<p>We perform all our deserialization experiments on a vocabulary file that contained 502K fields, as computed using:</p>
<pre><code>zcat vocabulary.json.gz | grep ':' | wc -l
</code></pre>
<p>We use gzip on all serialized files, both when writing them and when reading them.  The size of the vocabulary in different serialization formats was as follows:</p>
<p><center></p>
<table>
<tr>
<td><b>Format</b></td>
<td>gzip&#8217;ed size</td>
</tr>
<tr>
<td>protobuf</td>
<td>1.7 MB</td>
</tr>
<tr>
<td>JSON</td>
<td>1.9 MB</td>
</tr>
<tr>
<td>pickle (protocol -1)</td>
<td>4.0 MB</td>
</tr>
<tr>
<td>pickle (protocol 0)</td>
<td>4.3 MB</td>
</tr>
</table>
<p></center></p>
<p>gzip&#8217;ed JSON only use 10% more disk space than gzip&#8217;ed protobuf format, which is the most compact serialization format we tested.  JSON has the advantage of being human-readable, unlike protocol buffer.</p>
<h2>Setup:</h2>
<p>We tested on two different eight core x86-64 Linux installation environments.</p>
<p><center></p>
<table>
<tr>
<td><b>Name</b></td>
<td><b>Python version</b></td>
<td><b>CPU model name</b></td>
<td><b>OS version</b></td>
</tr>
<tr>
<td>dormeur</td>
<td>2.5</td>
<td>Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz</td>
<td>2.6.23.17-88.fc7</td>
</tr>
<tr>
<td>mammouth</td>
<td>2.6.1</td>
<td>Intel(R) Xeon(R) CPU           E5462  @ 2.80GHz</td>
<td>2.6.18-92.1.10.el5_lustre.1.6.6smp</td>
</tr>
</table>
<p></center></p>
<h2>Results:</h2>
<p>We read in the vocabulary using a particular deserialization approach.  We measure real time, as well as the combined user time and system time, using the Unix &#8216;time&#8217; command.  For each experiment, we ran the deserialization of the vocabulary three times, and averaged the times over these three runs. Variance appeared to be low, but we did not compute it.  We present all times in seconds.  Some experiments were not performed on mammouth.</p>
<p>The first result line in the table, &#8216;read&#8217;, is when we read the vocabulary json.gz file into memory, but do not deserialize it. It provides an upper-bound on the performance of the deserializer.</p>
<p>The following table presents the results, sorted by real time on dormeur.</p>
<p><center></p>
<table cellpadding="2" border="1">
<tr>
<td><b>deserializer</b></td>
<td colspan="2" align="center"><b>dormeur</b></td>
<td colspan="2" align="center"><b>mammouth</b></td>
</tr>
<tr>
<td></td>
<td>real</td>
<td>user+sys</td>
<td>real</td>
<td>user+sys</td>
</tr>
<tr>
<td>read</td>
<td>0.76</td>
<td>0.24</td>
<td>0.18</td>
<td>0.18</td>
</tr>
<tr>
<td>cjson</td>
<td>2.17</td>
<td>1.04</td>
<td>0.93</td>
<td>0.91</td>
</tr>
<tr>
<td>jsonlib</td>
<td>7.88</td>
<td>6.59</td>
<td>3.77</td>
<td>3.77</td>
</tr>
<tr>
<td>cPickle (protocol -1)</td>
<td>13.3</td>
<td>9.9</td>
<td>10.2</td>
<td>10.2</td>
</tr>
<tr>
<td>PySyck</td>
<td>19.1</td>
<td>18.2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>simplejson</td>
<td>24.7</td>
<td>16.2</td>
<td>1.10</td>
<td>1.04</td>
</tr>
<tr>
<td>cPickle (protocol 0)</td>
<td>25.1</td>
<td>20.4</td>
<td>20.7</td>
<td>20.7</td>
</tr>
<tr>
<td>protobuf</td>
<td>42.3</td>
<td>32.4</td>
<td></td>
<td></td>
</tr>
<tr>
<td>PyYAML</td>
<td>89.3</td>
<td>80.5</td>
<td>319</td>
<td>318</td>
</tr>
</table>
<p></center></p>
<p>Observe that simplejson is more than an order of magnitude slower on dormeur.</p>
<h1>Conclusions:</h1>
<p>gzip&#8217;ed JSON only use 10% more disk space than the most compact serialization format we tested (gzip&#8217;ed protocol buffer).  JSON has the advantage of being human-readable, unlike protocol buffer.</p>
<p>cjson has the fastest deserialization time of all packages we tested.  We have not measured serialization time in the experiments above, but we do so in the next section.</p>
<p>We did not realize that simplejson was far slower on one of our installs until we did speed tests.  simplejson should be avoided unless you specifically determine that it is comparable in speed to cjson.  On certain installs, simplejson deserialization is as fast as cjson.  On other installs, simplejson deserialization is an order of magnitude slower than cjson.  On &#8220;slow&#8221; installs, the user is led to believe that C speedups have been compiled into simplejson. Indeed, evidence indicates that our &#8220;slow&#8221; simplejson installation was, nonetheless, using C speedups:</p>
<pre><code>&gt;&gt;&gt; simplejson.decoder.make_scanner
&lt;type 'simplejson._speedups.Scanner'&gt;
&gt;&gt;&gt; simplejson.decoder.scanstring is simplejson.decoder.c_scanstring
True
</code></pre>
<p>The user might not only detect that simplejson is slow without using a direct speed comparison to cjson.</p>
<p>protobuf is interesting because it requires one to declare the protocol schema. This is useful for documenting your data format. Unfortunately, the Python implementation of Google&#8217;s Protocol Buffers is very slow because it is <a href="http://news.ycombinator.com/item?id=498982">pure Python</a>.</p>
<p>Generating C++ Protocol Buffers and wrapping them with swig, as suggested by this <a href="http://news.ycombinator.com/item?id=499040">commentator</a>, might be faster than cjson.  Hand-coding C serialization routines is another option if one must eke out every last bit of speed.</p>
<h1>Related work:</h1>
<p><a href="http://bouncybouncy.net/ramblings/posts/json_vs_thrift_and_protocol_buffers_round_2/">This study</a> and <a href="http://gist.github.com/72412">this followup</a> provide supporting evidence that cjson is faster than alternatives. Neither of these studies experienced any simplejson slowness.</p>
<p>We used bouncybouncy&#8217;s <a href="http://bouncybouncy.net/ramblings/files/sertest2.tgz">sertest2 code</a> code, and modified it to CDumper and CLoader (the C libyaml bindings) in PyYAML.  We modified their code to create 100K records.</p>
<p>Here is the output of sertest2 running on dormeur, which we have modified slightly for improved readability:</p>
<pre><code>100000 total records        (0.830s)

get_thrift                  (0.300s)
get_protobuf                (5.010s)

Serialize:
ser_cjson                   (0.270s) 6807019 bytes
ser_simplejson              (2.210s) 6807019 bytes
ser_yaml                    (31.590s) 6107019 bytes
ser_protobuf                (19.760s) 1716519 bytes

Serialize to a gzip'ed file:
ser_cjson_compressed        (0.520s) 1245257 bytes
ser_simplejson_compressed   (2.440s) 1245257 bytes
ser_protobuf_compressed     (19.920s) 980508 bytes
ser_yaml_compressed         (31.610s) 1205509 bytes

Deserialize:
serde_cjson                 (0.510s)
serde_simplejson            (12.370s)
serde_protobuf              (36.740s)
serde_yaml                  [slow, got tired of waiting for it]
</code></pre>
<p>bouncybouncy&#8217;s related study also compares with <a href="http://incubator.apache.org/thrift/">thrift</a>, which we do not use.  bouncybouncy finds that thrift is faster than protobuf but slower than cjson.  When we installed thrift (SVN revision 757299) on dormeur, sertest2 thrift routines crashed with the following traceback:</p>
<pre><code>Traceback (most recent call last):
  File "./test_speed.py", line 169, in &lt;module&gt;
    print 'serde_thrift        (%0.3fs)' % t(serde_thrift)[0]
  File "./test_speed.py", line 138, in t
    ret = f()
  File "./test_speed.py", line 108, in serde_thrift
    s = _ser_thrift()
  File "./test_speed.py", line 73, in _ser_thrift
    return thrift_to_bytes(ret)
  File "./test_speed.py", line 59, in thrift_to_bytes
    var.write(protocolOut)
  File "gen-py/passivedns/ttypes.py", line 146, in write
    iter6.write(oprot)
AttributeError: 'str' object has no attribute 'write'
</code></pre>
<p>The results presented in this section, as well as the results of the related studies, matches the relative performance of these libraries on mammouth in our earlier experiments.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fblog.metaoptimize.com%2F2009%2F03%2F22%2Ffast-deserialization-in-python%2F&amp;linkname=Fast%20deserialization%20in%20Python"><img src="http://blog.metaoptimize.com/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://blog.metaoptimize.com/2009/03/22/fast-deserialization-in-python/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>
