Use flag –xml when you run mysqldump

October 14th, 2009 | Tags: , , , ,

Summary:

If you have text data (like a web scrape) stored in a MySQL database, and you want to share the data, mysqldump to XML using the --xml flag.

When fields are unlikely to contain tabs, an even simpler format is a tab-separated file, created using the --tab=path flag to mysqldump. path must be owned by the MySQL database user.

The Problem with the standard MySQL dump format

The standard MySQL dump looks as follows

INSERT INTO `sources` VALUES (1,'2009-03-07 22:06:36','"You\'ve got to be kidding me"', ...

The problem is that the standard dump format is difficult to interact with programmatically.

It is difficult to parse using regular expressions because you cannot merely search for single quotes. You have to search for single quotes that are not preceded by a backslash (unless, perhaps, that backslash is preceded by a backslash).

Also, there are no libraries for reading the standard dump format, nor scripts for converting it into a standard format like JSON or XML. I asked the oracle as well as stackoverflow.

So if you receive a MySQL dump in the standard format, you might have to install MySQL and import the dump to get at your data.

The tabbed MySQL dump format

You can create a directory with one file per table, and the table will be one-row-per-line, with tab-separated values:

mysqldump --tab=path database

Here is some example output:

1	2009-03-07 22:06:36	"You've got to be kidding me"

If you get an error of the following form when you issue the mysqldump command:

mysqldump: Got error: 1: Can't create/write to file 'path/database.txt' (Errcode: 13) when executing 'SELECT INTO OUTFILE'

You can resolve this complaint by making sure that /tmp/path is owned by the mysql user (and also writeable by the current Unix user). Thanks JinRong Ye!

This format is convenient if none of your data contains tabs. In NLP, however, it is quite possible that your text will contain tabs.

The XML MySQL dump format

Enter the XML MySQL dump format:

        <table_data name="sources">
        <row>
                <field name="id">1</field>
                <field name="created_at">2009-03-07 22:06:36</field>
                <field name="text">&quot;You've got to be kidding me&quot;</field>

Ah… pure bliss. You can get the XML dump format as follows:

mysqldump --xml database
Reblog this post [with Zemanta]
  • Share/Bookmark
  1. October 14th, 2009 at 18:17
    Reply | Quote | #1

    In response to Joshua Reich:

    Let me answer your last question first:

    > 4. Why aren’t you using postgres ?

    I was getting data from someone that uses MySQL.
    Knowing what I know now, I believe should should have advised him to use the –xml flag.

    > 1. I am friends with awk & pals, and stripping out INSERT .. VALUE from
    > mysql dumps that I get from people is no biggy

    I am friends with perl, and you cannot simply split using /,/ to get your fields. The comma might be right in the middle of a string.

    With XML, though, it is simple to grep for <row>, because you know 100% that < will only be in the markup.

    > 2. I’m pretty sure MySQL supports dumping table data as CSV (SELECT …
    > OUTHOUSE ‘/tmp/file.csv’ …)

    Same problems as above.

    > 3. For big data, XML is just silly big.

    Why? It gzips easily.
    Not being able to load it all into memory is less of an issue if it is easy to split the data using regular expressions.

  2. October 14th, 2009 at 20:47
    Reply | Quote | #2

    XML has its place (somewhere), but in this programmer’s humble opinion, exporting tabular data is not one of them.

    http://www.dataspora.com/blog/xml-and-big-data/

MetaOptimize is Digg proof thanks to caching by WP Super Cache