This year marks the 10th anniversary of Dataminr’s founding. It’s incredible to think just how far our AI platform has come: Today, we integrate more than 150,000 public data sources spanning social media, web forums, blogs, local media, the deep/dark web, and public sensor data. Our AI platform processes and cross-correlates an expanding variety of data types and formats, including text in multiple languages, images, video, sound and streaming machine data from sensors. In 2019, we increased the number of data sources in our platform at our greatest rate, launched sound-based signals and deep/dark web signals, and dramatically expanded our work in multi-model event detection.
This rapid expansion of our AI platform into new formats, fields, and datasets is made possible because of our platform’s capacity to learn over the past decade. Dataminr’s work with public Twitter data has paved the way for our success integrating more data sources, and I wanted to provide a deeper look into how we’ve worked with this particular, unique data source.
Dataminr’s exploration of public Twitter data has been an integral component of our research and learning since 2009. When Dataminr was founded, Twitter was three years old and there had been roughly 1 billion tweets published in total on the platform. Today, hundreds of millions of tweets are published every day. The scale, global adoption, and societal impact of Twitter has skyrocketed since Dataminr was founded. From our early years, when public Twitter data was our first data source, until today, when it is one of thousands, public Twitter data has been a source that is full of unique and multi-dimensional value for detecting real-time events.
Dataminr processes every public tweet in real-time. When a public tweet is published, Dataminr’s AI platform receives that tweet instantaneously as a real-time input. Along with the text of the public tweet, Dataminr ingests a variety of public data fields attached to each tweet, totaling billions of real-time signals a day. Stated another way, Dataminr’s platform processes tens of thousands of signals every second generated from Twitter alone.
Dataminr runs its first AI models on each of these billions of daily inputs in 7.8 milliseconds. We use a broad and diverse spectrum of AI models to score, rank, filter, and cluster tweets, and identify, classify, and summarize events described within tweets. These AI models span supervised, semi-supervised and unsupervised learning. They extensively use neural networks, ranging from convolutional neural networks to recurrent neural networks, including long short-term memory networks. Our AI algorithms applied to tweets integrate a range of AI methods from several scientific fields, including natural language understanding, computer vision, and natural language generation.
Today, as Dataminr’s EVP of Product and Engineering, I oversee a growing team of more than 115 AI experts, engineers, and researchers who build, train and continuously update Dataminr’s ML and AI models. One of the things I find especially interesting is how my team continues to discover new historical patterns in public data that can train new AI models for more effective detection of real-time events in the present.
When one thinks of the impact of a large global data source like Twitter, it’s easy to focus just on Twitter’s real-time qualities and on the impact of individual tweets. But that is simply the tip of the iceberg–historical public data holds key patterns as well.
For starters, Dataminr has a record of the full set of events we’ve detected from public Twitter data (i.e., the annotated alerts on breaking events previously delivered to clients generated by Twitter data). These alerts, which were created by AI models and by our team of professional domain experts operating in an online human-AI feedback loop, provide a highly accurate historical labeled dataset of events. Dataminr’s systems and data scientists leverage this uniquely valuable data asset to iteratively train and optimize the performance of numerous real-time event detection algorithms.
From a 100,000-foot view, public Twitter data has within it all the “fossils” of the historical digital landscape—a record of events that were tweeted about spanning the globe. This provides not just an indicator of the events themselves, but also the patterns of how those events first surfaced on Twitter, the clusters of initial tweets describing the event, a proxy for the attention curve of local, national, and global audiences that those events received, the “sub-events” that occurred as an event transpired, and the sequence of related events that might have occurred as a result. These unique data slices from the Twitter aggregate have proven to be valuable inputs to AI models that run today on both public tweets and many other data sources and have expanded the scope and range of breaking real-time events we detect for our clients.
Historical public Twitter data is also filled with unique patterns that arise from Twitter’s role as a digital platform through which the news cycle flows. Within historical public Twitter data, there is the following: a comprehensive record of the news stories that were covered by most local, national and global media outlets, an unmatched repository of memes that went viral, as well as the digital signatures of the reach and impact of these news stories and memes. These multi-faceted data building blocks have been integrated in a number of our AI models for detecting events in real-time — ranging from multi-variable real-time online learning models that predict the possible “newsworthiness,” “novelty,” and “pre-news” probabilities of newly published public tweets, to algorithms that measure the velocity of unexpected digital attention shifts suggesting pre-viral content, to models that plot out key digital “nodes” across the dynamically shifting digital topography and can pinpoint propagation patterns and “node hops,” suggesting potential meme virality.
Public Twitter data also serves as one of the most comprehensive global proxies for collective digital expression. Public Twitter data demonstrates the evolution of digital language, vocabulary and syntax–in other words, the nature of drift in human conversation: how and when new words, phrases and expression structures came into the digital lexicon, how the meanings of these linguistic features changed over time, and how topics, concepts, and meanings expanded, shifted and intersected in global digital expression.
These unique data trends serve as critical inputs for Dataminr’s real-time event detection models–ranging from contextual event understanding models, adaptive unsupervised learning-based alert topic classification systems, conditional random field neural network NER models, event location prediction models, and even our natural language generation models. (Stay tuned for an upcoming blog post on how we have expanded our use of natural language generation and transformer neural networks in the process of creating alert headlines).
Briefly outlined, the four areas above represent just a small set of the types of unique historical patterns in public Twitter data that we’ve modeled and integrated into our AI approaches for discovering breaking events faster and more comprehensively. Today, when Dataminr integrates a net new data source, we extract far more value, far more quickly, due to how much our AI has learned (and continues to learn) from the past. Our AI platform’s unmatched experience over the last 10 years has triggered a dataset-network effect that is accelerating the value of Dataminr’s products and signals exponentially.
Public Twitter data represents just a fraction of the public data that our team gets to explore every day to create novel approaches and models for real-time event detection. Would you like to work on extremely complex data challenges, while also having global impact and making the world a safer place? Dataminr is rapidly expanding our team of AI experts, engineers, and researchers. Look here to see if one of our open jobs are right for you!