Data management Series: What matters most in Data lifecycle for ML applications

Today (tail end of 2019), the rate of adoption of machine learning has crossed the initial curve in most leading technology organizations across the globe. Here is a great article on deployment of ML by Benedict Evans. Today, most of the machine learning is still supervised learning and hence data dependent. Technological advancements like BERT and ever data hungry neural networks are still looking for more & better data

In consumer space, it feels like data story is converging with Google and Facebook (though that is a different discussion and orthogonal to this topic). Here are some of my observations in the dynamics of data space:
1. In enterprise, data is more verticalized and industry specific.
2. There are variety of attributes like privacy, accountability and transparency related to data.
3. With all the different flavors to data, there are very few data standards when it comes to Data management

This series of posts is my attempt to outline lessons learned during my tenure at Google research teams, mostly notes to myself and for someone to know more about “Data for ML and data governing challenges”. Following is a brief outline for the series –

  • Data acquisition
    • Ethics in collecting data- User trust, data donation and product reciprocation, collection policies, GDPR.
    • Distribution and diversity of data – How market segmentation, product features, product strategy and growth investment affects data 
    • Bias – how to build a data acquisition strategy and to a certain extent, preventing  data bias
    • Cost – How brand, opportunity and products play a key role in cost of data collection 
    • What to collect and what can be collected – Thoughts on feasibility vs sensibility of collecting data
  • Data Storage –
    • Retention policies – where and how to store data and access control policies 
    • Shelf life – How to determine valid lifespan for data and possible ways to handle this in data agnostic systems
  • Data Usage –
    • Data discovery and mechanisms 
    • Data visualization techniques and common requirements 
    • Curation / cleaning / noise / Anonymizing 
    • Labeling and annotation for training and validation 
    • Data and annotation analysis 
  • Open sourcing / Publication- 
    • Publishing datasets with research papers, organization policies and competitive perspective on releasing Datasets 
    • Discovery of published dataset and attracting contributions 
    • Data as a marketing tool, setting up competitions and accepting contributions

Lastly, I’ll try to finish this series with my thoughts on building generic and verticalized  data business for B2B like Scale.com

Leave a Reply

Your email address will not be published. Required fields are marked *