tubeulator

TfL open data interface library

tubeulator began following the COVID pandemic, in which the ability to move around was curtailed by lockdowns, and during which much of the Transport for London (TfL) network was modified, restricted, or halted altogether (such as the night tube, which returned partially in 2021).

I began this project with zero awareness of what the TfL Open Data service offered.

I was fascinated by the Differentiable Neural Computer paper (Graves et al. 2016) which was a much-hyped paper at the interregnum of the LSTM and Transformers eras (which we now see in technologies like ChatGPT). This paper used navigation of the London Underground as a task to demonstrate the capabilities of its model, with some bold claims.

This was fascinating for the idea that a graph task could be done so simply, but in reality I doubt this would actually be practical based on the tube network connectivity alone. Specifically the question of a shortest path is actually not a practical reality due to scheduling. Two routes with given purported durations are not decidable in the real world like this (you may need to wait for longer between interchanges if you happen to just miss a train, even if the total time you would be aboard the trains is less than on the other route).

The first step was to wire up all its APIs via codegen (which turned into something of an entity resolution odyssey, as these had been essentially anonymised, perhaps by mistake, in an update).

I came up with a trick to identify the entities and avoid using the old 'Unified' API, which is marked as deprecated, but which many had continued using to get around this issue with entity names.

In this initial phase of the project I wasn't familiar with Pydantic, but would later repurpose the 'codegen' (templated code generation) module to export interfaces as Pydantic models instead of dataclass-wizard dataclasses in 2024.

The credentials are stored in MongoDB, a document-oriented database (which I'd never used before, and since discovered some people hold uncharitable opinions of), and the plan is for the routes etc. to be too.

At the time of writing [May 2024] this had not been accomplished, as I noticed that some new data had quietly been added via 2 new API endpoints, and one of these was Station Data. This meant that the station data did not have to be pulled as a constant unknown but could be verified as being entirely known or not based on the last update of this dataset.

At this point, I introduced Patito: a combination of Polars, the Pandas dataframe competitor written in Rust for speed and memory efficiency, and Pydantic, a library which provides runtime type-validated dataclasses it calls "data models". This let me consume datasets using type-validated data models, and turned out nicely, effectively moving the need to handle data types out of the code and into these data model schemas. I'm a big fan of this style of programming with a clean separation of state (the data types and fields loaded from source data) and behaviour (genuine mutation of values).

Requiring this data model to be explicitly drawn out upfront

This groundwork then let me make some nice plots of the "station points" (entrances, info desks, nearby bus stops) for each of the TfL stations.

Unsurprisingly upon reflection, the routes between stations were not part of the "Station Data". The relation of stations to their platforms, and of platforms to the "services" (Tube lines etc.) was but this did not in itself give connectivity (only unordered sets).

The next step will be to load these routes from the APIs into the document-oriented database. For this I plan to migrate from MongoDB to TinyDB (which doesn't require an external server), but may end up supporting both.

I'd also like to rewrite more of the code with Pydantic now that it's been introduced into the library, as it has a tendency to simplify code flow and make for a nicer developer experience.

On a final and more serious note I'd also like to generate a recording of the official Tube robo-voice saying:

It's time for the tubeulator!