This page last changed on Feb 23, 2008 by martinmueller@northwestern.edu.

This is a report on our experience at NU of scaling up to 40 million words in NCF/Stein in WordHoard, using the MySQL database product, along with some thoughts based on that experience on the big problem, which is scaling up to on the order of a billion words in Monk.

While this note is long and unavoidably technical, it actually oversimplifies many of the details and omits many important technical considerations, but we hope without sacrificing anything essential to a proper understanding of the problems we face.

The following remarks begin with observations that are specific to WordHoard, but apply to Monk in the sense that if the Monk datastore takes a similar approach, or supports similar functionality, the scale problems will be of a similar nature. The details here are most relevant if we decide to use MySQL for the Monk datastore. The general discussion, however, if not all the details, remains relevant if we decide to use some other relational database product for the datastore, and/or some other approach like triples or Nora-DB.

It takes about 8 hours to build the 18 gigabyte WordHoard MySQL database for the 40 million words in NCF/Stein, running on our new "Scribe" development host. Scribe is a dual-core dual-CPU AMD Opteron 2214 with 8 GB RAM running Red Hat Linux. We are using MySQL version 5.0.22 with the tuning parameters set to the values recommended for "huge" databases. The build process starts with MorphAdorner output files, runs them through a new program named "ConvertMorph" to translate them into WordHoard input files, then runs those files through the existing largely unmodified WordHoard build process to construct the MySQL database. This work represents scaling up by a factor of about 22, from 1.8 million words in our previous WordHoard database to the 40 million words in our new NCF/Stein database. 

As expected, the bottleneck in this process is importing the several large tables into the MySQL database and the construction by MySQL of the many indices defined on those tables. These steps of the build process do not scale linearly. They have complexity n times the logarithm of n, not just n, where n is the number of words. We see evidence of this in the scaling work we've done. The 40 million word database takes considerably more than 22 times as long to build as does the 1.8 million word database - it actually takes about 30 times as long. For 1.8 million words, the MySQL table import and index construction steps take 36% of the build time, while with 40 million words they take 60% of the build time.

To build a billion words, we would need to scale up by another factor of 25. A rough estimate is weeks of time to build such a database, which would be on the order of half a terabyte in size. That's weeks - not hours or days! This is clearly unmanageable. While it's quite feasible that we could squeeze out some more performance in the current build process by further experimentation with MySQL tuning parameters, etc., we should expect any such gains to be marginal, not fundamental. It is a reasonable conclusion to reach at this point that unlike what we've done to scale from 1.8 to 40 million words, solving the problem of scaling to a billion words requires an architecture change, not more relatively simple tweaking and tuning around the edges of the problem.

To make it even worse, importing MySQL tables and building their indices is an operation that is in theory but is not at all in practice easily split into many small pieces that could be done in parallel, and at the scale of a billion words the table importing and index building would require most of the time. Thus, trying to attack the problem by using multiple processors running in parallel wouldn't help much.

The final piece of bad news is that while incremental updates are certainly possible in theory with a single billion word database, in practice they do not work at all well at this scale. One might wish, for example, that if we had a billion word database, adding or replacing a single 40 million word corpus would take just 8 hours. That's not the case. We suspect that this operation would also take weeks. That's only a guess, but it's a rather educated one based on past experiences with MySQL incremental updates involving adding or changing a large number of rows in big tables with large numbers of indices. The MySQL manual supports this hypothesis, and recommends essentially doing full rebuilds of the indices in these kinds of situations, to avoid excessive disk seeks.

What can we do? Divide and conquer is the obvious strategy. We would like to use multiple independently constructed databases, not just one huge monolithic one. The problem is that our mission is to make this transparent to our users, who must be able to do efficient queries and run analytic procedures in a reasonable amount of time across arbitrary subsets of the entire collection, without any constraints on the complexity of the questions they are permitted to ask or the kinds of problems they ask us to solve. Another way to state this is that the goal is to impose no restrictions of any kind on how users are permitted to join together data from one place with data from another place. The term "join" here is deliberately used with two meanings - both the plain vague English meaning and the precise technical meaning in the context of relational database queries that need to perform "join" operations on data tables.

A second problem is that we have an existing code base of several hundred thousand lines of code that took several man-years to develop, and of course we would like a solution that did not require throwing away that code base and starting over from scratch. This is a secondary goal, however, not the most important one. If we can pull this off, that would be fine, but if we can't, so be it.

MySQL has a "merge storage engine" which looks promising at first glance as a potential solution to both of these problems. In MySQL, a "merge table" combines multiple other tables with identical structure into one big virtual table.

With merge tables, we could build smaller parts of the database independently. There are many ways to do this, but for the sake of simplifying the discussion and using a concrete example, let's assume for now that we build one component database per corpus, and that we have a total of a billion words divided into 25 corpora of size 40 million words each. Given 25 processors/hosts/disk-drives working in parallel, we could build a billion words in the same amount of time it takes to build 40 million words, about 8 hours. When all the parts have been built, resulting in 25 separate 40 million word databases, merge tables would be used to virtually combine them into what appears to be and is functionally equivalent to a single very large database of a billion words, with no constraints on the kind of queries that can be issued against the big database, and with no need to radically modify or rewrite existing code that issues those queries. This is exactly what we need to solve both of our problems!

Incremental updates become quite simple with this approach. To add a new 40 million word corpus or to rebuild an existing 40 million word corpus, we would just build it and move the resulting tables into the existing billion word database. This would take about 8 hours in either case - the time required to build or rebuild the 40 million words. We would not need to reindex or rebuild any of the other corpora.

This approach could also help address our intellectual property problems. The individual tables for each corpus could be restricted so that access is permitted only from hosts at those institutions which have licensed the corpus. A merge table would permit access only from hosts at those institutions which have licensed all of the underlying corpora merged by the table. This is quite flexible, and multiple merge tables could be constructed to represent various subsets or "IP views" of the database for this purpose.

Alas, as with all free lunches, this is too good to be true. A Monk datastore built in this way would almost certainly result in unacceptable performance. The MySQL merge storage engine has no knowledge of how the data is distributed over the component tables, so it is very stupid. When a lookup is requested using a merge table, all of the underlying component tables are queried for the result. In our situation, where we're talking about merging together as many as 25 separate underlying tables, or perhaps even more, this is unacceptable. In the overwhelming majority of cases, we know in advance that the data we are looking for resides in only one of the underlying tables, or perhaps in at most a few of them, but MySQL has no idea that this is the case or how to exploit it to get reasonable performance.

One way to fix this performance problem might be to modify the MySQL merge storage engine to add Monk-specific intelligence. For example, suppose the engine receives a request to retrieve the lemma for a particular word occurrence, or to find all the words in some particular work with some particular lemma, or perhaps to extract a sparse matrix of frequency counts for all the lemmas in all the works in some corpus. Because of the way we have constructed the database and the merge tables, we know in advance in each of these three examples that all of the data that we need to find and gather together reside in the underlying table for a single corpus. There is no need to consult all the other underlying tables for all the other corpora in these examples. An appropriately modified engine would have this extra knowledge about the structure of our particular database, and it would quickly look up the requested information in just the right places, ignoring the places where it knows it cannot possibly find the information it needs. The theory is that with enough intelligence about how the data is structured, for the great majority of operations, this would result in performance which is not only acceptable, but is, for all practical purposes, as good as what we could get from a single monolithic database without merge tables. If this idea worked, it would result in good performance, while still offering full support for relational queries involving arbitrarily complex table joins across corpora.

We have not yet looked at the source code for MySQL's merge storage engine, so we don't know exactly how much work this would be, or even if this idea is feasible. There may well be a fatal flaw in this idea that we haven't yet realized. If it could be made to work, it would not be a trivial project, that much seems certain. It would be at least several months of work.

Another even more ambitious idea is that it might be possible to develop such a custom merge storage engine that is also capable of doing the lookups over distributed databases, where the underlying databases and their tables could reside on multiple hosts on the Internet rather than all on a single host. This would represent a kind of combination of the MySQL merge and federated storage engines, customized and optimized for Monk. Again, we do not yet know how much work this might be, but it's an obvious and an attractive idea.

These are just ideas at this point, not proposals by any means. They may or may not turn out to be worth investigating further. We have no significant experience with these kinds of very large and complex databases in our team at NU, so any help we could get in thinking about these problems and their potential solutions would be much appreciated. For example, is anyone aware of other ideas that might make MySQL do what we need it to do at this scale? How about other database products like Oracle? Do they offer better solutions to these problems than MySQL?

We are also interested in finding out what similar issues arise in alternative Monk datastore strategies like triples and Nora-DB and how they might be dealt with. The big questions are the same ones we have talked about here in the context of MySQL:

  1. What is the query potential of the strategy? Does it permit arbitrarily complex queries over all of the available data, or does it impose constraints on the kinds of questions users can ask and the kinds of analyses they can do?
  2. How much time does it take to build a datastore? For 40 million words? For a billion words?
  3. What if any opportunities exist for exploiting multiple processors running in parallel in the build process?
  4. How would incremental updates work?
  5. What opportunities are there for divide and conquer strategies? Are there tradeoffs involving capabilities and performance?
  6. How does the strategy deal with intellectual property issues?
  7. Could the strategy be made to support distributed datastores?

We certainly don't have full answers to all these questions for MySQL, but we've started to think about the problems.

Document generated by Confluence on Apr 19, 2009 15:04