In support of a colleague’s data analysis, I built a live online data flow
application in Elixir to ingest large quantities (100s of gigabytes) of social
media data, process and filter the data, and load the data into a separate
database for directy querying. We utilized an iterative, agile methodology to
development the data processing and filtering techniques. I utilized Elixir for
its easy parallelism and functional nature.
In my last post, I covered my first attempt to implement TCP streaming in
Flow, a data flow library for Elixir. My first attempts involved a bunch of
failed Unix sockets, and an attempt to implement a GenStage that failed for
reasons I didn’t understand. I eventually settled on this:
As part of a new and exciting project, I was faced with the task of ingesting a large amount of more or less homogeneous JSON data into a SQL database for an associate of mine to do some rudimentary business intelligence analysis on it. The context complicated things: the bulk data was a bunch of historical social media data, and in future he would also want to ingest the live API in addition to this archived historical data.