Jump to content
IGNORED

Neural Network/deep learning for crowd noise removal


EXTRASUPER81

Recommended Posts

This just occurred to me as a thing that could be done, so thought I'd post on here to see if anyone knows anything about this subject, as I know basically what I have read in articles in Wired. So possibly even less than nothing.

 

So any ideas? It would be really nice to clean all chatter etc off of bootlegs...autechre 2008, 2010, 2016 I am looking at you.

Link to comment
Share on other sites

I have worked with neural networks and machine learning stuff a bit, although I feel I probably know fuck all about deep learning (convolutional neural networks). I think I can grasp the general concept though. 

 

Generally this stuff is done by training the neural network with samples of crowd noise and samples of music, so the thing would eventually be able to tell the difference between the two. However it seems really difficult to get the data for training, because you probably don't have a lot of noise recorded from the same event that you are trying to clean up and using sound from a different event may throw off the neural network (different crowd type, room acoustics, sound system etc.). Similarly you would ideally need "pure" samples of the same audio that was played at the event (although maybe album versions would do in a pinch if the neural network is good enough).

 

Assuming you get all the data you need, then what you basically want is to feed in the training data - samples of crowd noise and "pure" audio (maybe try to feed it through some reverb to approximate venue sound) - and build the model, then feed in the noisy audio. I think the model would then basically go through every sample (realistically that's more like 1-2 seconds of audio at a time) and add/substract what it thinks is the "noise" part of signal amplitude. On paper this looks simple enough, but I think there could probably be some audible artifacts: for instance when there's a more quiet part and some asshole hollers over it in the recording, then the neural network is going to have a hell of a time to recreate what audio actually should be there. Basically, what I want to say here is that it's quite possible that the neural network model is going to add unpredictable audio artifacts to the result no matter what, so even if you have successfully removed the noise, there still will be some extra sound that wasn't actually played by any artists.

 

However all this stuff is back of the envelope theory, would be nice to have the opportunity to actually try it out in real life. There's some amazing shit possible with the deep learning neural networks, so I would not be surprised if someone puts together a thing for noise removal. I have not checked any science about this really, I think there is definitely something being researched.

Link to comment
Share on other sites

I'm guessing a recurrent neural network would be the way to go, though I have no idea how much training data would be necessary, nor how one would go about picking hyperparameters.

Link to comment
Share on other sites

My guess/intuition: don't bother.

 

Unlikely to outperform already existing methods. But who knows? Whatever technique you're going to use, it'll probably do two things:

1. Extract sounds from waveform

2. Interpolate the changed waveform to restore assumed ideal waveform without extracted sounds.

 

Both steps use implicit assumptions and estimations, regardless of the technique you're going to use. Deep neural nets or something else (example of techniques use in photoshop to edit images, which is similar to sounds). My guess is that a complex/expensive technique like deep neural nets will hardly, if at all, outperform a simple technique that's already available.

 

Reasoning is that it's unlikely to find a solution which can model "crowd noise" better than whats already out there. Deep neural nets tend to depend on lots of data and needs problems with a static set of rules. Like games, or language. When it comes to crowd noise and especially autechre gigs I'm afraid you'll be limited by a lack of static rules defining crowd noise, or defining autechre to not be noise.

 

You could try though. But be prepared to spend a lot of time without a likely benefit, other than experience and lessons learned.

 

In the next tweet is a link to an interesting article on the possibilities/limitations of deep neural nets. It's non-technical/ readable .

 

What you should take away from it, imo: you're not learning to label images or playing a game. To an extent you're trying to label noise. But additionally, you're also predicting sounds as you need to restore sounds after extraction of noise. Or otherwise, you'll be predicting what the noise would sound like without the music. So you'll be predicting crowd noise. Which might be more problematic as it doesnt look that will follow strict rules. Less so than music, arguably. But who knows, right?

Link to comment
Share on other sites

Hmmmm. Yes, it's a rabbit hole.

 

I was more hoping that someone had already done the heavy lifting on this one.

 

I've got a mate who's a lecturer in music tech at a University, I might see if he can get his students on it. That's what they're for right?

 

Then, the clean autechre recordings can be mine. ALL MINE. Mwhahahaha etc

Link to comment
Share on other sites

Izotope's RX repair tools (particularly De-noise & Dialogue Isolate) is probably the closest we have at the moment:

 

 

 

Yeah, that's what I would recommend. I recently used it to (partially) remove crowd noise from an interview

Link to comment
Share on other sites

Doing a little reading around, it seems the majority of the research going on at the moment is geared around isolating the human voice from background noise (usually other human voices). Some really impressive stuff, including some that can run faster than realtime on a Raspberry Pi.

 

However, I couldn't find anything regarding the separation of musical information from background noise specifically. I would think that it would be possible theoretically (especially given music like autechre that is more or less all synthetic), but not without great difficulty and low likelihood of the quality I am hankering after - artifacts are likely to be almost as irritating as that guy chuntering away in the background. Also I imagine preserving the sound of the space would be very difficult.

 

I guess in my simplistic pre-sleep mind I was hoping for a process like: input dataset A (autechre oeuvre) > input dataset B (crowd noise) > run iterative learning processes on datasets > remove B from A. Alas it turns out that cutting edge tech is more complicated than that! Who would have known?

Link to comment
Share on other sites

 

including some that can run faster than realtime

it's really impressive that investigation of noise removal led to the discovery of time travel
Boom

 

The latest Raspberry Pi is really powerful.

Link to comment
Share on other sites

Pretty impossible but go for it, I think you're more likely to get a direct recording off an artist if they see you trying to train neural networks, right

?

 

I am thinking that if you train a neural network on some artist's material, you can then generate brand new stuff that sounds like that artist.

Also - I think this has been discussed here already - you could in theory take recordings of some famous pianist or guitarist, annotate them with note data, train your network and feed it your own piano roll to automagically have your solos played in their style. If it works for text, it might work for other audio too. Might be worth trying out some day.

Link to comment
Share on other sites

  • 2 months later...

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.