Comment by d_burfoot
You should train a GPT on the raw data, and then figure out how to reuse the DNN for various other tasks you're interested in (e.g. one-shot learning, fine-tuning, etc). This data setting is exactly the situation that people faced in the NLP world before GPT. I would guess that some people from the frontier labs would be willing to help you, I doubt even your large dataset would cost very much for their massive GPU fleets to handle.
Hi d_burfoot, really appreciate you bringing that up! The idea of pre-training a big foundation model on our raw data using self-supervised learning (SSL) methods (kind of like how GPT emerged in NLP) is definitely something we've considered and experimented with using transformer architectures.
The main hurdle we've hit is honestly the scale of relevant data needed to train such large models from scratch effectively. While our ~19.5 years dataset duration is massive for ecoacoustics, a significant portion of it is silence or ambient noise. This means the actual volume of distinct events or complex acoustic scenes is much lower compared to the densely packed information in the corpora typically used to train foundational speech or general audio models, making our effective dataset size smaller in that context.
We also tried leveraging existing pre-trained SSL models (like Wav2Vec 2.0, HuBERT for speech), but the domain gap is substantial. As you can imagine, raw ecoacoustic field recordings are characterized by significant non-stationary noise, overlapping sounds, sparse events we care about mixed with lots of quiet/noise, huge diversity, and variations from mics/weather.
This messes with the SSL pre-training tasks themselves. Predicting masked audio doesn't work as well when the surrounding context is just noise, and the data augmentations used in contrastive learning can sometimes accidentally remove the unique signatures of the animal calls we're trying to learn.
It's definitely an ongoing challenge in the field! People are trying different things, like initializing audio transformers with weights pre-trained on image models (ViT adapted for spectrograms) to give them a head start. Finding the best way forward for large models in these specialized, data-constrained domains is still key. Thanks again for the suggestion, it really hits on a core challenge!