Importance is the most crucial aspect of the current data movement. Storing and aggregating all your data is great, but eventually you’ll run out of disk or the ability to look at and comprehend all those reports. You will always need to know what’s important in your data and how it impacts your organization.
Innovation in open source libraries is paving the way for a deeper understanding of what is important. Determining importance covers a broad number of aspects: timeliness, personalization, past behavior, social impact, content, meaning, and whether or not something is actionable.
Despite all these challenges, we’ve made great strides when it comes to identifying important data, thanks to open source. Gains made in search engine technology like Apache Lucene/Solr have revolutionized our ability to deal with multi-structured content at scale, rank it, and return results in a timely manner. Search engines have evolved significantly in recent years to seamlessly collect, collate, and curate data across a wide variety of data types and are no longer just about fast keyword lookups. Combined with large scale data processing frameworks, it is now possible to build sophisticated solutions that take in your data, model it, serve it up to your users, and then learn from their behavior.
Google, Amazon, Facebook and others have been doing this for years and the power of open source makes these techniques available to the rest of us.
The end goal of all of this work is to create a virtuous cycle between your users and your data. The more your users interact with your system, the smarter your system gets. The smarter your system gets, the more your users will want to interact with it.