As promised in our initial Bleeding Edge post we are beyond excited to interview Paul Siegel of Brandwatch. A former co-worker of mine and someone I can personally say who made mathematical theory interesting to me at level I never felt beforehand and person I can blame for introducing me to Black Mirror.
Meet Paul Siegel – Principal Data Scientist @ Brandwatch
Paul, thank you again for taking the time to be interviewed about Innovation & the bleeding edge.
First of all, what is your background?
I began my career specializing in pure mathematics (operator algebras and topology) culminating in a 3 year stint as an assistant professor of mathematics at Columbia University. While at Columbia I became interested in data science, particularly due to connections between my academic research on the large-scale geometry of graphs and the dynamics of social networks. This led me naturally to social media data, and consequently I found a comfortable home in Brandwatch nearly 5 years ago. Since then I have branched out into other areas like NLP and time series analysis.
What is the bleeding edge for with regards to data?
Natural language processing experienced a breakthrough in late 2018, and it triggered a flurry of activity which I expect to continue through 2020. The breakthrough was a new mathematical model of language called a “transformer” which allows researchers / practitioners to construct sophisticated representations of language from huge corpora of documents – news, blogs, comment threads, etc. These representations can generate eerily human-like language with little human supervision, and perhaps more importantly they can be tuned to solve all sorts of more specific problems, like answering SAT-style questions based on a paragraph or characterizing author intent.
This strategy for building models for language analysis tasks is proving to be more effective than traditional machine learning models which are designed from the ground up to solve one specific problem. They also make it easier to create and maintain a suite of language analysis tools since many different tools can be created using the same underlying language model. This is why there has been an ongoing arms race throughout 2019 and continuing in 2020 to build bigger, faster, and more powerful language models – Google, Facebook, OpenAI, and now Microsoft (just a few weeks ago!) have publicly released models with millions or billions of parameters, and we’re going to see a lot of exciting progress over the next few years as the community discovers new ways to apply them.
What new dataset intrigues you the most?
There is a ton of great data out there, and more and more of it is free and publicly available. What gets me excited is not one single new dataset but the opportunities that become available when we play one dataset against another, or use several to collaboratively solve a new problem. Example: combine historical travel datasets with historical data on the spread of disease, and forecast the progression of the coronavirus. Or crawl industry conference websites and quarterly earning reports to predict partnership or M&A activity. Between Kaggle, r/datasets on Reddit, Google’s dataset search engine, FiveThirtyEight, ProPublica, and so on, there are thousands of interesting projects and opportunities waiting in the wings.
One last remark. Algorithmic bias is a huge and growing problem that we aren’t doing enough to address, and the first step is creating great datasets to hold models accountable. Does your sentiment classifier rate neutral documents containing the word “Muslim” as negative? Does your HR recruiting algorithm systematically predict that female candidates are less likely to be successful in your company than male candidates? Do you even know? Problems like these occur because mathematical models don’t reason – they extrapolate from whatever patterns they are trained on. These biases can be corrected by exposing them to datasets which punish the models for placing too much weight on the wrong signals, but there aren’t a lot of off-the-shelf datasets out there that do this. Creating such datasets would be a very delicate undertaking, but it would be enormously valuable.
What findings could you derive from it?
By making different datasets collaborate with one another we will discover new connections between different industries or different disciplines that we never saw before. It will also hold us accountable: maybe you think your platform is great at modeling some specific dataset, but if the signals you discover don’t matter in an adjacent domain then it should be cause for concern.
Constructing datasets that target algorithmic bias will hold us accountable in a different way – it will help us ensure that the goal of scientific and data-driven decision making actually helps make the world better, rather than just replicating the same old problems. It will also cut through the toxic fetishization of artificial intelligence by reminding us that we are responsible for the sophisticated new tools that we are using and by forcing us to understand them better.
Paul, thank you most sincerely for taking the time to speak with us today and we look forward on hearing how you achieve the future you’ve laid out here and continue leading the charge and innovation approaches in handling social data.
In our next post you’ll get to hear insights from Matt Murphy of Beanstalk Predictive.