wire/tap

An audio transcriber for web radio (pulling from BBC Sounds)

tap is a software pipeline I pieced together in early 2021, at a time when it had become both difficult and taxing to follow along with the news cycle. The pipeline took recordings of BBC radio broadcasts via beeb (among other sources) and proceeded in 3 core steps before exporting to a website (which I could check for a sign that the pipeline had finished):

1) VAD (Voice Activity Detection)

i.e. identifying where someone is speaking based on the 'prosodic feature' of intensity

This is important here as Transformer-based models can't process long inputs
The tool inaSpeecSegmenter seemed well spoken of in the community, and had been used in some large scale archival projects
I ended up having to supplement this manually, finding what I deemed to be low activity regions based on analysing the waveform data myself [with a simple moving average]
This was due to having segments that were still too long. The maximum token size of Transformer models is genuinely quite impractical, and led to some spurious artefacts downstream from where these breaks had to be inserted regardless of the sentence structure (despite this 'naive' effort to alleviate the problem)
I'd like to try Hervé Bredin's PyAnnote (introduced in April 2021), which as I understand it is a neural VAD and may perform much better/faster, but I'd be lying if I said I wasn't a little intimidated by the amount of preparatory data handling to satisfy a specialist library like that (the tutorials show that various YAML configs need to be created and then environment variables to them passed around). Since I don't see the VAD as the bottleneck to performance for now, it's not my top priority.
Project GitHub

2) Transcription

For transcription I at first hoped to use a small model, but the performance only approached a reasonable standard (i.e. became legible anywhere near useful accuracy) with the largest model. At the time this was wav2vec2-large-960h-lv60-self

The code here is in fact the simplest step.
I later took a course at GTC 2021 which covered the underlying fairseq toolkit, which I'd like to take a deeper look at
(I simply did not need to go into this level of detail to achieve my goal here, which was to process COVID press conference audio recordings and news programmes)
Since then, in July 2021 according to the fair-seq README, FAIR released a "robust" Wav2Vec2 model, which I have subsequently updated to (but I find it merely makes different mistakes, roughly equivalent to those it fixes).
Hsu et al. (April 2021) "Robust wav2vec 2.0"
Additionally, there is a strong 'Western' bias visible in the model's choices of transcription, e.g. "Carolina" is given repeatedly when "Kerala" or "Korea" are mentioned in speech. The solution to this dataset bias is most likely going to come from future generations of this technology. My personal expectation is that models should incorporate a degree of logical verification to identify mistakes (which such language models don't currently do).
As a final step, I pass the transcription through a T5 grammar fixer model, which does quite well at normalising the punctuation and capitalisation from the Wav2Vec2 output (which is all-caps and entirely without punctuation other than full stops inserted at the VAD-identified pauses).

A sample from October 15th, with major mistranscriptions in bold (emphasis added):

Good Morning It 6 O'Clock on Friday, the fifteenth of October. This is to day with Martha Carney and Nick Robinson, the headlines. This morning overseas H. G. V drivers are to be allowed to make more deliveries in the U. Ca. ["UK"] to tackle problems in the supply chain. We'll ask the Transport Secretary why the governments changed its mind about pulling the Immigration Leader to en staff shortages. Some people which've had cover ["COVID"] test de site in Barkshire have been advised to get retested after it emerged. Some negative results were wrong, could other parts of the country also be effected. Also to day, the legal time limit that means thousands of dollars.. We've had to keep all the window shut because it's horrible, really horrible. All the streets, OIS covering bents, "I've got four or five been backs, satin at home, watin to be taken, but nothing to do with hem you know, and can't." An Na Soaa s atadel ["And Adele"] releases her first track in six years will the critics go easy on her. The babysee ["BBC"] news is read by Allen Smith: "The rules on the number of deliveries which overseas H. G. V drivers can make in the W. K ["UK"] are to be relaxed to try to ease the pressure on supply chains." The government hopes the move will help to prevent shortages in the run-up to Christmas.

As can be seen above, many of the major mistakes being made come from proper nouns which could presumably be fine-tuned (examples here and here). To do so would require building a custom dataset from pairs of target text (sentence) and audio files (path).

Again the procedure to do this looks quite involved. Thankfully there seems to be a less fiddly way to fine-tune Wav2Vec2 using Lightning Flash (a task-oriented PyTorch framework built on PyTorch Lightning, docs).

3) Summarisation

The summarisation step (specifically, abstractive summarisation, rather than extractive) used another Transformer based model, DistilBART, a variant of BART (again via HuggingFace)

Eat my tokens

What that means is that a checkpoint finetuned for the abstractive summarisation task was loaded into the BART model in the Transformers library (unfortunately there's no documentation for DistilBART itself). The developer Sam Shleifer works at FAIR, ex-HuggingFace.

I wrote a little routine to summarise in chunks appropriate to the Transformer model in a subpackage in tap called 'precis'.

Lastly I passed the summaries into an exporter class I put together capable of targetting multiple different export file types: plain text, [plain] markdown, GitHub Markdown (featuring collapsible <details> blocks), and lever MMD. More details on that below.

An example of the summarisation output from the 19th of October:

Government announces plans to try and end sale of new gas boilers in 14 years. Two women who lost their fathers to corona virus are taking the government to court. A former soldier who was standing trial in Belfast in relation to a fatal shooting during the Troubles has died.

Astronomers are being asked to help examine five years worth of data captured by several telescopes. Budding astronomers will be asked to look out for the light of stars being temporarily dimmed, a sign that a planet may have passed in front of them. MPs will debate plans to day for new rules that would mean M p suspended for more than two weeks could face a recall petition.

City traders are betting that the Bank of England will increase interest rates as soon as next month. Finely restarants up and down the country reporting in all time high in thefts of luxury soaps and handwash from bathrooms. The EN ACES is feeling the strain of running several cronavirus vaxon programs at the same time as the annual flee bab.

A group of empires and peers are calling on their own pension fund to divest from Chinese companies accused of complicity in human rights violations. James Landale, our diplomatic correspondent, joins us how much of the pension fund is invested in these companies. Snow's 16 minutes past six ol onors are to be given grants of 5,000 pounds to replace their gas boiler with an electric heat pump.

Gas boilers produce one-fifth of the UCA's total carbon emissions, and those need to be eliminated to halt climate change. E-Net Zero stratogy, tipped for publication later to day R, really looks to address this point M. There's suggestions that in wind power will be quadrupled initially and will increase tenfold by 20-50.

The first thing that hits you on re-reading these is how relieving it is to be able to review these in the aggregate (as in, relieving you from the duty of going through them 'synchronously', both for the sometimes weighty subject matter and the length of time it'd take).

The items in the news vary from deaths and climate change to soap theft and recent debates in Parliament.

I find the quality of the summaries quite hard to gauge objectively, but it greatly improved after normalising the text with the grammar fixer.

This approach (collapsible text blocks with summaries expandable to full length transcripts) also allows you to hone in on the good/interesting/exciting/promising/distracting news.

The pipeline was quite slow to run, due to both the preprocessing and inference. One suggestion I received after publishing the source code was to try to operate on live streams, which would of course mean the pipeline could start sooner, and therefore finish earlier.

In fact, I strongly suspect there's a more reliable approach that could be taken in the cloud if opting for live chunked streaming rather than offline / full multi-hour download. I'm still thinking about what cloud infra would be best for this long-term

Event-driven Lambda functions triggered on a rolling 10-minute cron basis between the start and end of the programme? Or would a persistent EC2 instance be better suited?

For now my suspicion is that since the audio conversion and transcription are going to be completely separate, and therefore separate time steps will be out of step, they should be on parallel Lambda jobs.

In the current form, the act of processing the final segment of the news programme should, logically, then trigger the 'piecing it all together' part of the pipeline (summarisation and site publication).

However, this'd mean you wouldn't be able to check in on the progress to monitor the programme until it'd gone off air (which is indeed one of the principal motivators to shift to an 'online' mode of processing in the first place).

So in fact the routine would need to be rewritten in a pretty big way, batching as much of the summarisation as possible and storing leftovers between the 10 minute transcripts in some sort of persisted storage (low latency EFS ultimately pushing to S3, or more simply just using a good old fashioned git repo since it's only text).

This 'leftover' isn't necessary to start transcribing the next chunk, but is necessary to run summarisation. A bit of an awkward problem that can't be implemented without careful thought.

I also suspect that there could be an interesting potential pipeline if news was downloaded in parallel, and then matched against the stories, potentially to improve the transcription quality. However this would require a good deal of 'plumbing' I haven't set up yet.

wire

The other aspect of this project was wire, which displayed the transcripts in lever MMD format.

Lever is a file format I came up with at university to assist with writing notes at obscene speed, when you didn't want to think about indentation.

It has a few other tricks, but was primarily read-only before this project, for which I put together the necessary tools to convert it to HTML and display it on a website, wire, on poll.spin.systems.

For this project I used CI/CD on GitLab, with a completely programmatically written source (something I'd never done before with a static site). The source here is private (as are all the source repos for this website) but the tree is simply 3 top level nodes:

site/
transmission/
.gitlab-ci.yml

The YAML file does the standard procedure for a static site of deploying artifacts on a public path, but with a twist: when finished it sends a POST request to the GitLab API to the pipeline triggering endpoint for quill (also mirrored on GitHub), which then triggers the workflow defined in the quill repo's CI YAML that:

installs the quill package from PyPI
"sources the manifest" (which means: pull down the repos for each subdomain)
"stand up fold.wire" (which means: build the HTML pages for the transcripts in wire, from the tap pipeline above)
"remote push manifest" (which means: push the changes, if any, back to any/all git repos which changed during the build step)

This was a little tricky to wrap my head around at first, but it works perfectly. The only thing to remember is to remember to push the driver package, ql, to PyPI when it changes.

I saw some derisive comments not long after writing this, about how programmers will 'write an entire custom site builder', but I don't personally like that kind of de-skilling don't re-invent the wheel rhetoric, and would do it again (perhaps with Jinja templating, which I've since discovered makes it much faster to put together).

GitLab CI/CD has some great advantages (a lot of the comparisons online are outdated). I personally found the GitLab interface less mollycoddling and clearer about secrets etc. The YAML formats for the two are slightly different, but neither are hard, and repos can be synced/mirrored for the best of both worlds.
Sphinx is a widely used tool for generating site documentation but if your output isn't code (in this case it was radio transcripts) then I don't feel it really suits you.
This project also features partial builds, but the number of files doesn't really warrant much optimisation. By far the slowest step involved in iterating is the tap pipeline.

v2

In September 2022, OpenAI released Whisper. Whisper combines VAD, ASR and punctuation/casing into a single model, making much of the approach above obsolete.

This is of course great news!

Some observations from the new model:

No segmentation is needed, other than for debugging and evaluation. If segmentation is done, the model seems to even notice where a piece of speech was broken off mid-stream, and will represent it with "...".
The acronyms like "NHS" which Wav2Vec 2.0 read as "ennay chess" and names like "Martha Carney" (which Wav2Vec 2.0 completely flubbed) are consistently picked up by Whisper.
The small Whisper model (the default) is ridiculously fast and far greater quality than the best Wav2Vec 2.0 model.
The large Whisper model is still faster than the best (largest) Wav2Vec 2.0 model. There is a performance benefit in transcription accuracy (lower error rate)
Using a smaller segment produces a transcript with more precise timing (presumably smaller segments are produced by the VAD which are then given timings for). In other words, the transcript timings (similar to subtitle files) have "shorter lines".
There are still minor mistakes, for instance a news item about Britney Spears's court case introduced "The documentary fueling fans [sic] calls to free Britney Spears...", missing the apostrophe. It's possible that another pass with a language model could pick up this mistake.
The model does not spike during inference, and full usage is made of the GPU while memory usage sits at around 50%.
The large model processes around 4-5x faster than real time [on GPU], meaning that online 'streaming' processing is feasible. Something I'm considering is that perhaps you'd even have to introduce a delay, waiting for a pause to segment on.
At this speed, I wonder if you could throttle it to 50% GPU utilisation, to make a machine running online transcription more usable (if 100% is unnecessary to keep up with a real time stream).
Alternatively, you could imagine the entire pipeline being in real time (i.e. running summarisation in stops and starts after rounds of transcription).