dx

Indexer and topic modeller for the American Mathematical Society catalogue.

A project spun out of dex, this was initially intended to cover the GSM (Graduate Studies in Mathematics) series, one of my favourite mathematical book series, for which I often wished I had an index to find titles I have and others I may wish to buy.

However, the scraping ended up being so successful that I expanded it to cover several other series, reaching around 2,000 titles in total.

The output is a processed dataset of titles, abstracts, 'intended readership', book reviews, and tables of contents, and exploratory topic modelling with LDA.

Though 2000 titles sounds a lot for a person to consider, to a topic model it's still fairly small, and the model was limited by the dataset size. I'd seen references to 600 being the minimum viable size of a newsgroup dataset for LDA, while shorter documents (e.g. tweets) would be on the order of 5,000 to 10,000.

This was primarily for exploratory value (using LDA and comparing the tradeoffs in topic model quality from the different variants of LDA), with the aim of seeing whether particular subfields of mathematics could be indicated by particular over-represented word counts.