
Kaggling log pt. 1: getting oriented in a new competition
Hello, it's been a while! Madeline's been hard at work at her new job, and becoming an expert in the front end world, but she misses her excursions into data science and machine learning. So she logged onto Kaggle and found a cool competition about brains and decided to join it.
Hosted by Harvard Medical School, the competition is supposed to research harmful brain activity. The goal of the competition is to classify seizures and other patterns of harmful brain activity in critically ill patients using EEGs and spectrograms of brain activity, and feeding them to a machine learning model. The data given is sample brain activity as well as metadata on each sample in the form of a table.
The thing about Kaggle competitions is they can be very overwhelming - you're bombarded with a dataset you've never seen before that you have to seek to understand, as well as oodles of notebooks of code. To try to orient myself (yes I'm switching to first person now, lol), I thought about what my strengths are and how I could use them to get my bearings. I enjoy writing, and it strengthens my memory, so I decided to write down my process in the form of blogs and make copious notes on the code I would be writing. Furthermore, I made a promise to myself to understand any code I copied, before copying it!
I wrote an initial checklist of tasks that I'd need to go through to make a submission. I based it on a sample notebook which walks through the process of creating a submission for this competition. Here's the high level list I came up with (not exhaustive):
write down (understand!) details of competition
load data (harder than it looks)
figure out how to load pretrained models etc. without internet (the competition requires no use of internet)
-
create spectrograms out of EEGs? (what does this entail?)
-
train of off spectrograms rather than EEG which is a video format??
-
I had a little confusion over what exactly an EEG format is, so I needed to do further research here:
I found out that spectrograms visually represent a brain signal's strength over time
while EEG commonly refers to the whole procedure of attaching electrodes to the brain to detect electrical activity in the brain and then the recording of that result in a graph
-
Here's an example of what an EEG looks like (top) and what a spectrogram looks like (bottom) (from researchgate.net)
-
-
-
not sure how to do EDA on this data format
- perhaps check distributions of votes?
get a working model, how ever bad, asap, to work with and improve
-
don't forget you need an optimizer
- use Adam because why not?
-
define a learning rate schedule
- may need further research because I don't remember how to do this
add a model checkpoint to only save the best model at that point
train!
predict on test set (load best model weights)
create
submission.csv
Writing helps me remember, so I wrote out an abbreviated version of the competition explanation and rules to make sure that I was understanding them.
Then I sought to understand the information conveyed by the metadata - this is recorded in the columns of train.csv
- as well as the target (the thing the model would be trying to predict). I won't share my notes on this, simply because I just wrote out what was on the Kaggle website by hand to get it into my brain. I felt it was important to note that the target for a given segment of an EEG/spectrogram was the probabilities of a vote for each, with all the probabilities across the different categories summing to 1 for that given segment.
Next I wanted to get a better idea of the model being used in the sample notebook I chose to review, which was the EfficientNet2 model. Follow my blog to get notified of my next article, which will be on that architecture! In general this series is meant to be a chance for you to follow along with my Kaggling experience, to get to read about my thoughts, impressions, and yes even frustrations with the process. I hope you enjoy, and thanks for reading!