Thursday, 5 May 2011

Matrix Market to mysql and back again

[Update: Radim Rehurek, gensim's author, pointed out a way to achieve this with gensim indexes - see this thread for details.]

Here are some perl scripts for converting a sparse matrix from matrix market (.mm) format to mysql (or another database) and back again.

My purpose in creating these is to use subsets of a very large matrix with the gensim vector space modelling toolkit. For example, a matrix representing a bag-of-words model of 3.3 million or so English Wikipedia articles with a vocabulary of 100,000 words is rather large, around 7.4G in its matrix market (mm) file format.

To perform operations on a subset of the matrix (in my application, similarity queries on a small set of documents), it's useful to be able to quickly extract a given set of rows from the larger matrix, without reading the entire 7.4G file each time.

Thus, the scripts allow converting the large mm file into a mysql database, which can be queried efficiently to return a specific set of rows that can be converted back into a much smaller mm file that gensim can load for use with memory-bound operations such as MatrixSimilarity.

Dependencies
  • perl with CPAN modules DBI and DBD::mysql
  • mysql
Importing a matrix into mysql

mm2sql.pl reads a .mm file, and outputs a set of SQL statements to import the matrix into a mysql database. To create the schema which consists of the matrix_info and matrix tables and indexes, first create a database (for examples here called gensim), and then run:
pod2text mm2sql.pl | mysql -u root -p gensim
To import a matrix into the mysql database, run:
mm2sql.pl matrixname.mm | mysql -u root -p gensim
If you want to import more than one matrix into the same database, then set the matrixid value for each by editing the mm2sql.pl file before running the import command. The default matrixid is 1.

Exporting rows from a matrix

Use db2mm.sql for the reverse operation. First edit the script to set your local mysql connection info (hostname, database name, username and password). Then give the matrix id and row numbers on the command-line. For example to export rows 7, 9 and 17 from matrix id 1:
db2mm.sql 1 7 9 17 > newmatrix.mm
Note that the export script will renumber the rows in the matrix, so row 7 becomes row 1, row 9 becomes row 2, and row 17 becomes row 3. To preserve the original row numbering (and produce a very sparse matrix with many empty rows), it is fairly straightforward to edit the script to change this behaviour.

No comments:

Post a Comment