Ryan Gaspar is the Senior Machine Learning Engineer (MLE) at Rune Labs. He provides machine learning services both internally and externally to engineering and data scientists. Previously, he worked as an Electrical Engineer before moving fully into software and AI as a Data Scientist/MLE. His experience spans from large and small datasets, to machine learning and deep learning applications for Natural Language Processing (NLP) and Computer Vision (CV).
Working with large datasets and managing them can be a difficult challenge researchers face when analyzing data. Currently, those working locally are confined to a couple options:
Other alternative options include taking advantage of Cloud services, such as Amazon Web Services (AWS). At Rune Labs, we have built our infrastructure to support batch processing as a way to parallelize analyses efficiently for an entire dataset. This allows researchers to take full advantage of the entire dataset and draw better conclusions from the data.
Downloading the entire dataset locally allows researchers to work with the full set of data to ensure analyses are as accurate as possible, but it comes with its challenges. Downloading can take anywhere from several hours to days, which causes delays in project plans. Not only does downloading take time, but each analysis step in the workflow can take several hours as well.
Working on a small subset of data also comes with its challenges. Though you can get your results a lot faster than working with the entire dataset, it is imperative to have a deep understanding of the complete dataset to make informative decisions reflecting the subset used. Even then, the complete analysis of a subset of data may not fully reflect the entire dataset, especially when one is conducting models.
Batch Processing allows data scientists and researchers to perform their data analyses at scale. For our use case, we decided to use AWS Batch, which enables data analyses to be done through parallelization. Instead of a single step process in a researcher’s workflow to run on the entire dataset subsequently, the dataset can be broken up into subsets (or batches). This approach allows each subset to run the same analysis as needed.
In addition to parallel processing, we take advantage of AWS Cloud Services to increase compute power, compute time, and security without sacrificing anything on our local machine.
Imagine a step in a researcher’s workflow that requires computing the power spectral density (PSD). What does this process look like with and without batch processing?
If we do not use batch processing, we have to download the entire dataset locally. This requires us to run analyses on each row, line by line. If we use batch processing, the dataset is broken up into subsets and simultaneously performs PSD computations on each subset of data. This allows for a faster processing time.
In the image above, 4 of the 5 parallel child jobs have successfully processed through. The 5th child job has initiated the process for completion.
There are few reasons why batch processing is an effective part of a researcher’s workflow when running their data exploration and analyses:
This case study walks you through an example of computing PSD.
Note: Steps 3 and 4 are done passively in the background and completely automated—they do not require the user to monitor activities.
In addition, we have included version control to the workflow by taking advantage of the JobID provided by AWS Batch. Each job’s results can be sorted by this ID to have a history of results.
At Rune Labs, it is our goal to improve treatment options and outcomes for people with Parkinson’s Diseases. We enable clinicians and researchers to develop and deliver precision neurology by making brain data useful at scale. Our team uses engineering approaches like batch processing to not only help expedite the analysis time, but to help us discover and push results forward for our partners.
Our team of engineers and scientists is looking forward to meeting you.