Main menu

Diagnosing and Improving Textbooks with Data Mining

On May 1 I heard a stimulating lecture by Rakesh Agrawal from Microsoft Research in which he described work to diagnose the conceptual coherence and comprehensibility of textbooks and then automatically select additional web resources to augment the text at the problematic location. The key measure of coherence is called "dispersion" and it captures the intuitive idea that a section of text that discusses too many unrelated or weakly related concepts will be hard to understand. So you extract noun phrases (ignoring the very frequent ones), and then search Wikipedia to build a graph of the connections among the phrases. I.e., if a section mentions "metadata," "Dublin Core," and "gasoline" we'll probably find links between the first two but not with gasoline. This dispersion measure is then combined with a readability measure based on average word and sentence lengths, and sections with high dispersion and low readability become candidates for augmentation.

I think that TDO is very tighly written and is highly readable, but nonetheless we want it to be as good as it can be, and in particular we are creating enhanced ebook versions where we are asking the questions "what kind of enhancements like photos or interactivity add value" and "where best to incorporate them."

So I'm making a field trip down from Berkeley to Miscrosoft Research in Mountain View soon and will let you know if we can apply Agrawal's work to TDO.

The home page for the reserach project is

-bob glushko

Comments are closed.