An audio transcriber for web radio (pulling from BBC Sounds)

tap is a software pipeline I pieced together in early 2021, at a time when it had become both difficult and taxing to follow along with the news cycle. The pipeline took recordings of BBC radio broadcasts via beeb (among other sources) and proceeded in 3 core steps before exporting to a website (which I could check for a sign that the pipeline had finished):

1) VAD (Voice Activity Detection)

i.e. identifying where someone is speaking based on the 'prosodic feature' of intensity

2) Transcription

For transcription I at first hoped to use a small model, but the performance only approached a reasonable standard (i.e. became legible anywhere near useful accuracy) with the largest model. At the time this was wav2vec2-large-960h-lv60-self

A sample from October 15th, with major mistranscriptions in bold (emphasis added):

Good Morning It 6 O'Clock on Friday, the fifteenth of October. This is to day with Martha Carney and Nick Robinson, the headlines. This morning overseas H. G. V drivers are to be allowed to make more deliveries in the U. Ca. ["UK"] to tackle problems in the supply chain. We'll ask the Transport Secretary why the governments changed its mind about pulling the Immigration Leader to en staff shortages. Some people which've had cover ["COVID"] test de site in Barkshire have been advised to get retested after it emerged. Some negative results were wrong, could other parts of the country also be effected. Also to day, the legal time limit that means thousands of dollars.. We've had to keep all the window shut because it's horrible, really horrible. All the streets, OIS covering bents, "I've got four or five been backs, satin at home, watin to be taken, but nothing to do with hem you know, and can't." An Na Soaa s atadel ["And Adele"] releases her first track in six years will the critics go easy on her. The babysee ["BBC"] news is read by Allen Smith: "The rules on the number of deliveries which overseas H. G. V drivers can make in the W. K ["UK"] are to be relaxed to try to ease the pressure on supply chains." The government hopes the move will help to prevent shortages in the run-up to Christmas.

As can be seen above, many of the major mistakes being made come from proper nouns which could presumably be fine-tuned (examples here and here). To do so would require building a custom dataset from pairs of target text (sentence) and audio files (path).

Again the procedure to do this looks quite involved. Thankfully there seems to be a less fiddly way to fine-tune Wav2Vec2 using Lightning Flash (a task-oriented PyTorch framework built on PyTorch Lightning, docs).

3) Summarisation

The summarisation step (specifically, abstractive summarisation, rather than extractive) used another Transformer based model, DistilBART, a variant of BART (again via HuggingFace)

Eat my tokens

What that means is that a checkpoint finetuned for the abstractive summarisation task was loaded into the BART model in the Transformers library (unfortunately there's no documentation for DistilBART itself). The developer Sam Shleifer works at FAIR, ex-HuggingFace.

I wrote a little routine to summarise in chunks appropriate to the Transformer model in a subpackage in tap called 'precis'.

Lastly I passed the summaries into an exporter class I put together capable of targetting multiple different export file types: plain text, [plain] markdown, GitHub Markdown (featuring collapsible <details> blocks), and lever MMD. More details on that below.

An example of the summarisation output from the 19th of October:

  • Government announces plans to try and end sale of new gas boilers in 14 years. Two women who lost their fathers to corona virus are taking the government to court. A former soldier who was standing trial in Belfast in relation to a fatal shooting during the Troubles has died.

  • Astronomers are being asked to help examine five years worth of data captured by several telescopes. Budding astronomers will be asked to look out for the light of stars being temporarily dimmed, a sign that a planet may have passed in front of them. MPs will debate plans to day for new rules that would mean M p suspended for more than two weeks could face a recall petition.

  • City traders are betting that the Bank of England will increase interest rates as soon as next month. Finely restarants up and down the country reporting in all time high in thefts of luxury soaps and handwash from bathrooms. The EN ACES is feeling the strain of running several cronavirus vaxon programs at the same time as the annual flee bab.

  • A group of empires and peers are calling on their own pension fund to divest from Chinese companies accused of complicity in human rights violations. James Landale, our diplomatic correspondent, joins us how much of the pension fund is invested in these companies. Snow's 16 minutes past six ol onors are to be given grants of 5,000 pounds to replace their gas boiler with an electric heat pump.

  • Gas boilers produce one-fifth of the UCA's total carbon emissions, and those need to be eliminated to halt climate change. E-Net Zero stratogy, tipped for publication later to day R, really looks to address this point M. There's suggestions that in wind power will be quadrupled initially and will increase tenfold by 20-50.

The first thing that hits you on re-reading these is how relieving it is to be able to review these in the aggregate (as in, relieving you from the duty of going through them 'synchronously', both for the sometimes weighty subject matter and the length of time it'd take).

The items in the news vary from deaths and climate change to soap theft and recent debates in Parliament.

I find the quality of the summaries quite hard to gauge objectively, but it greatly improved after normalising the text with the grammar fixer.

This approach (collapsible text blocks with summaries expandable to full length transcripts) also allows you to hone in on the good/interesting/exciting/promising/distracting news.

The pipeline was quite slow to run, due to both the preprocessing and inference. One suggestion I received after publishing the source code was to try to operate on live streams, which would of course mean the pipeline could start sooner, and therefore finish earlier.

In fact, I strongly suspect there's a more reliable approach that could be taken in the cloud if opting for live chunked streaming rather than offline / full multi-hour download. I'm still thinking about what cloud infra would be best for this long-term

Event-driven Lambda functions triggered on a rolling 10-minute cron basis between the start and end of the programme? Or would a persistent EC2 instance be better suited?

For now my suspicion is that since the audio conversion and transcription are going to be completely separate, and therefore separate time steps will be out of step, they should be on parallel Lambda jobs.

In the current form, the act of processing the final segment of the news programme should, logically, then trigger the 'piecing it all together' part of the pipeline (summarisation and site publication).

However, this'd mean you wouldn't be able to check in on the progress to monitor the programme until it'd gone off air (which is indeed one of the principal motivators to shift to an 'online' mode of processing in the first place).

So in fact the routine would need to be rewritten in a pretty big way, batching as much of the summarisation as possible and storing leftovers between the 10 minute transcripts in some sort of persisted storage (low latency EFS ultimately pushing to S3, or more simply just using a good old fashioned git repo since it's only text).

I also suspect that there could be an interesting potential pipeline if news was downloaded in parallel, and then matched against the stories, potentially to improve the transcription quality. However this would require a good deal of 'plumbing' I haven't set up yet.


The other aspect of this project was wire, which displayed the transcripts in lever MMD format.

Lever is a file format I came up with at university to assist with writing notes at obscene speed, when you didn't want to think about indentation.

It has a few other tricks, but was primarily read-only before this project, for which I put together the necessary tools to convert it to HTML and display it on a website, wire, on poll.spin.systems.

For this project I used CI/CD on GitLab, with a completely programmatically written source (something I'd never done before with a static site). The source here is private (as are all the source repos for this website) but the tree is simply 3 top level nodes:

The YAML file does the standard procedure for a static site of deploying artifacts on a public path, but with a twist: when finished it sends a POST request to the GitLab API to the pipeline triggering endpoint for quill (also mirrored on GitHub), which then triggers the workflow defined in the quill repo's CI YAML that:

This was a little tricky to wrap my head around at first, but it works perfectly. The only thing to remember is to remember to push the driver package, ql, to PyPI when it changes.

I saw some derisive comments not long after writing this, about how programmers will 'write an entire custom site builder', but I don't personally like that kind of de-skilling don't re-invent the wheel rhetoric, and would do it again (perhaps with Jinja templating, which I've since discovered makes it much faster to put together).


In September 2022, OpenAI released Whisper. Whisper combines VAD, ASR and punctuation/casing into a single model, making much of the approach above obsolete.

This is of course great news!

Some observations from the new model: