We work with a lot of different API and data providers. While we want to keep adding new APIs and data sources to the Streamdata.io API Gallery, it is important that we spend time getting to know each API provider, understanding the value and source of their data. When it comes to many of the common types of data it is common to find multiple APIs where you can access the same data, leaving us on a quest to find the true source. The history and provenance of data is important to us, and in our mind makes data more valuable when you can trace it back to the source.
There is a lot of scraped, purchased, and questionably acquired data and content on the market. This is something that impacts the overall quality and value of the data and is something you don’t always want to be consuming an integrating into your own databases. We are happy to profile multiple sources of the same data, because not all API and data providers are created equal, but in all situations, we want to understand as much as possible about where the data comes from, as well as the aggregation and enrichment it experiences. Some API and data providers are good at providing this provenance as part of their services, where others make it much more difficult to understand where their data originates.
In the last decade of working with data online, provenance hasn’t always been important. However, in a GDPR, security, and privacy troubled landscape, data provenance will become more important. This doesn’t just end with raw data and is something that will be important when it comes to the machine learning models in which data is used as part of the training processes. Data is valuable, but the big data landscape is also getting very competitive, and this will be something that drives down the value more commonly available data and drive up the price of higher quality data, which has a pedigree, and include provenance. Providing a pretty important differentiator for API and data providers to begin thinking about, and baking into their data and API product an service offerings.
We are beginning to think more about data provenance and investigate it more when we are profiling our APIs. We’d like to begin including it as part of our overall rating system for API providers, helping us understand which APIs are the most relevant and usable. We are also beginning to look more at the regulatory requirements for companies who operate in highly regulated industries and translate some of what we find to be applicable to the wider data and API sectors. While it will take several years, we feel that data provenance will become a common discussion when it comes to buying and selling of data, and making it accessible via APIs. Something that will increasingly differentiate not just data and API providers, but also the consumers of the data that is available across the landscape.
Image Credits: Paul Askew