March 10th, 2010 | Tags:

Okay, I’m ready.

After reading a handful of articles making tenuous connections between entrepreneurship and music, including :

I’ve decided to come out and share my favorite startup music.

Dirt, by The Stooges, is a proto-punk cut that sprawls for seven-minutes, brooding and smoldering. It never climaxes or burns out, it just persists and drives forward.

Anyway, I believe this song should be the mantra for boostrappers, in particular those that practice the lean startup methodology.

    Ooh, I been dirt / And I don’t care / Cause I’m burning inside / I’m just a yearning inside / And I’m the fire o’ life.

Without further ado, DIRT:

  • Share/Bookmark
February 27th, 2010 | Tags: , ,

Summary

I propose a default “Constitution for Governance of Open-Source Projects”.


Background

I recently got involved in the OSQA project, which is a fork of CNPROG, which in turn is a clone of the StackExchange Q&A forum software.

Note that the OSQA project has no formal “homepage”, or instructions on how to get involved. I only discovered by chance that there is a mailing-list (unarchived) and developer chat room. Nor was it immediately clear which OSQA github fork should one use.

This is because OSQA grew organically from one contributor to a handful, and developer involvement was an afterthought in this project. Not that there is anything wrong with that.
However, now that a handful of people are involved in the project, and more people are trying to get involved, we have begun discussing governance and decision-making policies on the mailing list. In fact,
Evgeny Fadeev poses this very question on StackOverflow, and proposes some potential answers.

I believe that, by default, there are some simple but clear principles that should be enunciated. I hereby propose my

Constitution for Governance of Open-Source Projects (v20100227)

Let it be affirmed that the primary goal in instituting governance of an open-source project be to ensure the long-term health of the project.

Accordingly, the default bias should be towards openness and inclusiveness.
However, policy should be changed as issues present themselves, in order to maintain the long-term health of the project.

For the model of decision making, we favor a “do-ocracy”.
The people who contribute the most generally command the respect of the community.
Alienating them is the best way to derail the project.

The repository should be open the committers, given that commits can easily be reverted and commit-access easily revoked. This is preferable to alienating potential committers.

To ensure transparency for developers new and old, and allow them to decide their involvement in a project based upon the history of the project, their should be transparency and openess in the inner working of the project. For example, the email archive should be public.

Lastly, let us remember that too much red-tape gets in the way of progress. So red-tape and other barriers to contribution should be avoided, and only added as issues present themselves.

This Constitution can and should be amended as issues present themselves.

Therefore be it resolved.

  • Share/Bookmark

Summary

A pattern for persisting generators is to turn them into pickle-able class objects. This is useful when you use generators for streaming training examples.

I would also try generator_tools, which might be a more convenient alternative to the pattern I describe. I haven’t used it yet.


Generators for streaming training examples

For machine learning, python generators are a simple idiom that make it easy to generate a stream of training examples. Moreover, you can nest generators:

  • The inner generator can be used to read one example at a time.
  • The outer generator can be used to read examples from the inner generator until you have a full minibatch, and then yield this minibatch.

Here is some example code:

[Update: The example holds without the ALL CAPS magic variable names, "HYPERPARAMETERS". However, I include HYPERPARAMETERS because I am including the actual code I am using. Hyperparameters are global, read-only variables that specify the particular experimental condition being tested. I can't say that I have the best solution to this particular aspect of experimental control (hyperparameters). I might write a blog post about it in the future, to solicit feedback on improved methods. However, I have refined my current approach over several years, and I can assure you that it is far less painful than a handful of more "clean" approaches.]

def get_train_example():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")

    from vocabulary import wordmap
    for l in myopen(HYPERPARAMETERS["TRAIN_SENTENCES"]):
        prevwords = []
        for w in string.split(l):
            w = string.strip(w)
            id = None
            if wordmap.exists(w):
                prevwords.append(wordmap.id(w))
                if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                    yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
            else:
                prevwords = []

def get_train_minibatch():
    HYPERPARAMETERS = common.hyperparameters.read("language-model")
    minibatch = []
    for e in get_train_example():
        minibatch.append(e)
        if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
            assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
            yield minibatch
            minibatch = []

You can’t persist training state by pickling your generators

However, generators become problematic when you want to persist your experiment’s state in order to later restart training at the same place. Unfortunately, you can’t pickle generators in Python. And it can be a bit of a PITA to workaround this, in order to save the training state.

Pattern to workaround this annoyance

Following useful discussion on pylearn-dev and stackoverflow [1] [2], I propose the following pattern for converting generators to pickle-able class objects:

  1. Convert the generator to a class in which the generator code is the __iter__ method
  2. Add __getstate__ and __setstate__ methods to the class, to handling pickling. Remember that you can’t pickle file objects. So __setstate__ will have to re-open files, as necessary.

Here is the updated code, after applying this pattern:

class TrainingExampleStream(object):
    def __init__(self):
        # Set the state variables, in case pickling happens before __iter__ is called.
        self.filename = None
        self.count = 0
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        from vocabulary import wordmap
        self.filename = HYPERPARAMETERS["TRAIN_SENTENCES"]
        self.count = 0
        for l in myopen(self.filename):
            prevwords = []
            for w in string.split(l):
                w = string.strip(w)
                id = None
                if wordmap.exists(w):
                    prevwords.append(wordmap.id(w))
                    if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
                        self.count += 1
                        yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]
                else:
                    prevwords = []

    def __getstate__(self):
        return self.filename, self.count

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.  If we wanted
        to be really fastidious, we would assume that
        HYPERPARAMETERS["TRAIN_SENTENCES"] might change.  The only
        problem is that if we change filesystems, the filename
        might change just because the base file is in a different
        path. So we issue a warning if the filename is different from what is expected.
        """
        filename, count = state
        print >> sys.stderr, ("__setstate__(%s)..." % `state`)
        iter = self.__iter__()
        while count != self.count:
#            print count, self.count
            iter.next()
        if self.filename != filename:
            assert self.filename == HYPERPARAMETERS["TRAIN_SENTENCES"]
            print >> sys.stderr, ("self.filename %s != filename given to __setstate__ %s" % (self.filename, filename))
        print >> sys.stderr, ("...__setstate__(%s)" % `state`)

class TrainingMinibatchStream(object):
    def __init__(self):
        pass

    def __iter__(self):
        HYPERPARAMETERS = common.hyperparameters.read("language-model")
        minibatch = []
        self.get_train_example = TrainingExampleStream()
        for e in self.get_train_example:
            minibatch.append(e)
            if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
                assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
                yield minibatch
                minibatch = []

    def __getstate__(self):
        return (self.get_train_example.__getstate__(),)

    def __setstate__(self, state):
        """
        @warning: We ignore the filename.
        """
        self.get_train_example = TrainingExampleStream()
        self.get_train_example.__setstate__(state[0])
  • Share/Bookmark
October 14th, 2009 | Tags: , , , ,

Summary:

If you have text data (like a web scrape) stored in a MySQL database, and you want to share the data, mysqldump to XML using the --xml flag.

When fields are unlikely to contain tabs, an even simpler format is a tab-separated file, created using the --tab=path flag to mysqldump. path must be owned by the MySQL database user.

The Problem with the standard MySQL dump format

The standard MySQL dump looks as follows

INSERT INTO `sources` VALUES (1,'2009-03-07 22:06:36','"You\'ve got to be kidding me"', ...

The problem is that the standard dump format is difficult to interact with programmatically.

It is difficult to parse using regular expressions because you cannot merely search for single quotes. You have to search for single quotes that are not preceded by a backslash (unless, perhaps, that backslash is preceded by a backslash).

Also, there are no libraries for reading the standard dump format, nor scripts for converting it into a standard format like JSON or XML. I asked the oracle as well as stackoverflow.

So if you receive a MySQL dump in the standard format, you might have to install MySQL and import the dump to get at your data.

The tabbed MySQL dump format

You can create a directory with one file per table, and the table will be one-row-per-line, with tab-separated values:

mysqldump --tab=path database

Here is some example output:

1	2009-03-07 22:06:36	"You've got to be kidding me"

If you get an error of the following form when you issue the mysqldump command:

mysqldump: Got error: 1: Can't create/write to file 'path/database.txt' (Errcode: 13) when executing 'SELECT INTO OUTFILE'

You can resolve this complaint by making sure that /tmp/path is owned by the mysql user (and also writeable by the current Unix user). Thanks JinRong Ye!

This format is convenient if none of your data contains tabs. In NLP, however, it is quite possible that your text will contain tabs.

The XML MySQL dump format

Enter the XML MySQL dump format:

        <table_data name="sources">
        <row>
                <field name="id">1</field>
                <field name="created_at">2009-03-07 22:06:36</field>
                <field name="text">&quot;You've got to be kidding me&quot;</field>

Ah… pure bliss. You can get the XML dump format as follows:

mysqldump --xml database
Reblog this post [with Zemanta]
  • Share/Bookmark
September 17th, 2009 | Tags: ,

Summary

A script for automatically sorting graph curves, e.g. for gnuplot.

Problem

When you have a bunch of curves, and you plot them in an arbitrary order, you might get the following:

Typically, you want to sort the graphs in what appears to be visually descending order, as follows:

Sorting the curves is usually done manually, by eyeballing the curves. However, manual sorting of graph curves can become tedious. And when some curves don’t go out as far on the x-axis, it can be even trickier to place these short curves. (Some curves might be short if this experimental run trains more slowly.)

Heuristic approach

An automatic heuristic sorting approach is as follows:

  • We maintain a sorted list of curves, from highest to lowest. The sorted list is initialized to empty.
  • At each iteration, we find the curve that goes the furthest out on the x-axis, but is not yet in the sorted list. We then will choose where to insert it into the sorted list.
    • For this curve and all curves in the sorted list, we want an estimate of the curve value at the current curve’s furthest x-value. We compute this estimate using a moving average. (For this reason, all curves should have aligned x-axis steps, and should have equidistant x-axis steps.)
    • We place this curve into the sorted list, to minimize the number of rank errors of curve estimates at this x-value.

And that’s it!

Example output

Here is the sorted output of a larger, more difficult example, sorted using the above heuristic. Click on this image to get a larger version you can inspect:

A few of the decisions aren’t good. For example, why is curve 15 placed about curve 6? But most of the decisions are reasonable. For example, curve 13 is placed at the bottom, because it is very low compared to the other curves for the short duration that curve 13 is present.

Code

I have written a script implementing the heuristic above.

Here is the latest version of sort-curves.py.
You will also need movingaverage.py from my Python common library.

USAGE:

./sort-curves.py *.dat

where every *.dat is in standard (gnuplot) two-column-per-line format:

xvalue yvalue

Overall, I find this script a useful timesaver.

Reblog this post [with Zemanta]
  • Share/Bookmark
March 22nd, 2009 | Tags: , ,

All standard YMMV disclaimers apply.

Update (20090324-2): According to John Millikin, the author of jsonlib, cjson is buggy and unmaintained. I will evaluate further and post a followup blog entry. My discussion with Dan Pascu, the author of cjson, corroborates these claims. I urge readers to read John Millikin’s comment.

Summary:

For quickly deserializing data in Python, use cjson.
simplejson is mysteriously slow on certain installations.

Update (20090324): According to Extra Cheese, cjson 1.0.5 has an incompatibility with simplejson in processing slashes. A fix is available from Matt Billenstein. However, Dan Pascu, the author of cjson, deprecates Matt Billenstein’s cjson 1.0.6 because Matt’s patch parses the JSON twice, which makes it twice as slow. This will still be faster than all alternatives in certain circumstances. You will not find Matt’s cjson on the cheeseshop, only on Matt’s site.

Abstract:

We were initially using simplejson for our work, because the JSON format is human-readable and because anecdotal evidence from the blogosphere touted simplejson’s new C speedups. We observed that simplejson was actually quite slow on one of our installation environments. This observation prompted to do this study. We found the cjson consistently achieves the fastest deserialization performance. We still do not understand why simplejson is slow in certain installation environments.

Approach:

We compared the following serialization approaches:

  • simplejson 2.0.9, with C speedups
  • jsonlib 1.3.10
  • cjson 1.0.5
  • PyYAML 3.05 with libyaml 0.1.1/0.1.2 C bindings. (We used 0.1.1 on dormeur and 0.1.2 on mammouth.)
  • PySyck 0.61.2 with syck 0.55 C bindings. Note that PySyck did not compile until we followed the advice in this ticket.
  • Google protobuf 2.0.3
  • Python pickle, protocol=-1 (binary)
  • Python pickle, protocol=0 (text)

We have not tried the following serialization approaches:

  • Python marshall, which is supposedly much faster than Python pickle. On the downside, the marshal format may change between Python versions.
  • Native Python, i.e. reading the repr() of the data as a module
  • XML implementations
  • Facebook thrift
  • Hand-coding C serialization

Experiments:

Data:

We were working with a data structure we call the “vocabulary”. The vocabulary is a list of vocabulary terms. Each vocabulary term in turn contained a list of term forms. An example vocabulary term is as follows:

{
    "term class": "the propos delet",
    "canonical form": "the proposed deletion",
    "rank": 3590,
    "count": 7180.0,
    "term forms": [
        { "form": "the proposed deletion", "count": 7153.333333333333 },
        { "form": "the proposed deletions", "count": 13.666666666666666 },
        { "form": "The proposed deletion", "count": 12.0 },
        { "form": "the proposed deletes", "count": 1.0 }
    ]
}

We perform all our deserialization experiments on a vocabulary file that contained 502K fields, as computed using:

zcat vocabulary.json.gz | grep ':' | wc -l

We use gzip on all serialized files, both when writing them and when reading them. The size of the vocabulary in different serialization formats was as follows:

Format gzip’ed size
protobuf 1.7 MB
JSON 1.9 MB
pickle (protocol -1) 4.0 MB
pickle (protocol 0) 4.3 MB

gzip’ed JSON only use 10% more disk space than gzip’ed protobuf format, which is the most compact serialization format we tested. JSON has the advantage of being human-readable, unlike protocol buffer.

Setup:

We tested on two different eight core x86-64 Linux installation environments.

Name Python version CPU model name OS version
dormeur 2.5 Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz 2.6.23.17-88.fc7
mammouth 2.6.1 Intel(R) Xeon(R) CPU E5462 @ 2.80GHz 2.6.18-92.1.10.el5_lustre.1.6.6smp

Results:

We read in the vocabulary using a particular deserialization approach. We measure real time, as well as the combined user time and system time, using the Unix ‘time’ command. For each experiment, we ran the deserialization of the vocabulary three times, and averaged the times over these three runs. Variance appeared to be low, but we did not compute it. We present all times in seconds. Some experiments were not performed on mammouth.

The first result line in the table, ‘read’, is when we read the vocabulary json.gz file into memory, but do not deserialize it. It provides an upper-bound on the performance of the deserializer.

The following table presents the results, sorted by real time on dormeur.

deserializer dormeur mammouth
real user+sys real user+sys
read 0.76 0.24 0.18 0.18
cjson 2.17 1.04 0.93 0.91
jsonlib 7.88 6.59 3.77 3.77
cPickle (protocol -1) 13.3 9.9 10.2 10.2
PySyck 19.1 18.2
simplejson 24.7 16.2 1.10 1.04
cPickle (protocol 0) 25.1 20.4 20.7 20.7
protobuf 42.3 32.4
PyYAML 89.3 80.5 319 318

Observe that simplejson is more than an order of magnitude slower on dormeur.

Conclusions:

gzip’ed JSON only use 10% more disk space than the most compact serialization format we tested (gzip’ed protocol buffer). JSON has the advantage of being human-readable, unlike protocol buffer.

cjson has the fastest deserialization time of all packages we tested. We have not measured serialization time in the experiments above, but we do so in the next section.

We did not realize that simplejson was far slower on one of our installs until we did speed tests. simplejson should be avoided unless you specifically determine that it is comparable in speed to cjson. On certain installs, simplejson deserialization is as fast as cjson. On other installs, simplejson deserialization is an order of magnitude slower than cjson. On “slow” installs, the user is led to believe that C speedups have been compiled into simplejson. Indeed, evidence indicates that our “slow” simplejson installation was, nonetheless, using C speedups:

>>> simplejson.decoder.make_scanner
<type 'simplejson._speedups.Scanner'>
>>> simplejson.decoder.scanstring is simplejson.decoder.c_scanstring
True

The user might not only detect that simplejson is slow without using a direct speed comparison to cjson.

protobuf is interesting because it requires one to declare the protocol schema. This is useful for documenting your data format. Unfortunately, the Python implementation of Google’s Protocol Buffers is very slow because it is pure Python.

Generating C++ Protocol Buffers and wrapping them with swig, as suggested by this commentator, might be faster than cjson. Hand-coding C serialization routines is another option if one must eke out every last bit of speed.

Related work:

This study and this followup provide supporting evidence that cjson is faster than alternatives. Neither of these studies experienced any simplejson slowness.

We used bouncybouncy’s sertest2 code code, and modified it to CDumper and CLoader (the C libyaml bindings) in PyYAML. We modified their code to create 100K records.

Here is the output of sertest2 running on dormeur, which we have modified slightly for improved readability:

100000 total records        (0.830s)

get_thrift                  (0.300s)
get_protobuf                (5.010s)

Serialize:
ser_cjson                   (0.270s) 6807019 bytes
ser_simplejson              (2.210s) 6807019 bytes
ser_yaml                    (31.590s) 6107019 bytes
ser_protobuf                (19.760s) 1716519 bytes

Serialize to a gzip'ed file:
ser_cjson_compressed        (0.520s) 1245257 bytes
ser_simplejson_compressed   (2.440s) 1245257 bytes
ser_protobuf_compressed     (19.920s) 980508 bytes
ser_yaml_compressed         (31.610s) 1205509 bytes

Deserialize:
serde_cjson                 (0.510s)
serde_simplejson            (12.370s)
serde_protobuf              (36.740s)
serde_yaml                  [slow, got tired of waiting for it]

bouncybouncy’s related study also compares with thrift, which we do not use. bouncybouncy finds that thrift is faster than protobuf but slower than cjson. When we installed thrift (SVN revision 757299) on dormeur, sertest2 thrift routines crashed with the following traceback:

Traceback (most recent call last):
  File "./test_speed.py", line 169, in <module>
    print 'serde_thrift        (%0.3fs)' % t(serde_thrift)[0]
  File "./test_speed.py", line 138, in t
    ret = f()
  File "./test_speed.py", line 108, in serde_thrift
    s = _ser_thrift()
  File "./test_speed.py", line 73, in _ser_thrift
    return thrift_to_bytes(ret)
  File "./test_speed.py", line 59, in thrift_to_bytes
    var.write(protocolOut)
  File "gen-py/passivedns/ttypes.py", line 146, in write
    iter6.write(oprot)
AttributeError: 'str' object has no attribute 'write'

The results presented in this section, as well as the results of the related studies, matches the relative performance of these libraries on mammouth in our earlier experiments.

  • Share/Bookmark
TOP

MetaOptimize is Digg proof thanks to caching by WP Super Cache