Part 1: Data acquisition principles, cost and more insights

In this first blog post (of a series), I will cover key principles involving sourcing of data; primarily for training and validating machine learning models. This article focuses on “what” to do rather than “how” to do it. Points mentioned below are ranked from the perspective of workflow than importance. 

Photo by Markus Spiske on Unsplash

Understanding data goals and purpose-

True for almost any task. In my observation, this critical step is usually an afterthought or neglected activity in data collections effort. Framing from the basic principles of ML, a model is as good as a data and training algorithm. Defining output and quality metrics that define quality of model will drive data needs. Thoughtful tolerance or allowance of bias in model output is the essence of understanding training data goals. Initial sections of Building Intelligent systems by Geoff Hulton is a great read to brush up some of these fundamentals in depth.

Awareness of data consumers and target use case –

Understanding target consumers when gearing up for data acquisition efforts is a key aspect of successful data sourcing. Researchers, data scientists and developers are among the few. Different roles have different data needs. e.g. – researchers are interested in data around their hypothesis or experiments. Data scientists look for causality data. For developers, source of data, format, and ease of plumbing might be the most important 

Amount of data

Most data consumers in the organization will tell you that more data will yield better model quality. Realistically, many teams are facing limitations on the amount of data that can be collected. Cost, quality and availability are primary constraints . A statistical significance data volume is the key to make this decision. Lately, there is a buzz around “Data network effects” in association with ML based products. Idea is- more users use your product, more data they contribute and the better the product gets, which attracts more customers. Many startups and incumbents want to build around this idea to manufacture the “Winner takes it all” effect. Here is a slightly contrarian view. A side note, Industry is progressing towards achieving better or equal model quality with less data. Unsupervised, active and reinforcement learning techniques are finding their place for many use cases in industry. 

Data Diversity

User behavioral and environmental context is extremely important in successful ML implementation. Most of us are able to recall the controversial case of Google Photos Gorilla incidence. For consumer features,  a model performing in an unbiased way for any user (irrespective of gender, race and demographics) is the crux of “ML fairness”.

Diversity in data is a great way for technology oriented products to achieve prediction fairness  for every user. It’s very tempting to source behavioral data from high availability regions or sources but, it’s likely to lack diversity. 

Ethics in collecting data

In data collection, consideration to user’s privacy and data rights is both, trust building and rewarding to users.  Following are the few important aspects – 

  • User trust – User’s trust is a core aspect of  brand building. Technically superior products can quickly lose user’s trust due to cutting corners in security (zoom) or frequent non-dependability with core functionality (apple maps). Most users assume that the products they use works in a privacy preserving way, unless consented. It is imperative to communicate with users how and where their data will be used. Apple seems to be doing a good job in this aspect.    
  • Data donation – Users are exposed to few opportunities for data donation, especially when user response to the output is classified as unexpected. Many products ask permission to upload logs and usage statistics. As a good data collection practice, it’s important to explain how user data donation is going to make the product better in a simple and understandable way. GDPR compliance helps solve many of these issues to a certain extent. 
  • Product reciprocation – Most recommenders  category of ML products asks users to contribute data to make the experience personalized. It is generally a good idea to surface a recommendation, reasoning with user contributed data. This is one of the many ways to bring transparency to users data usage.  

Bias and variance with data collection-

Human behavior is biased when it results in an erroneous  judgement due to incomplete experience, lack of context or instinctive response. Most behavioral biases carry negative connotations but mostly, avoidable. Human judgement is a big factor in generating training data and quality perception of ML systems, which makes them susceptible for biased output. Couple challenges in this context worth keeping in mind to with data collection  – 

  1. Data – Training data with inappropriate distribution which is not reflective of  the real world will train a biased model. One way to think of bias is how far the model predictions are from the real world. A biased model will fail to represent the irregularities OR patterns in the data accurately or will over index on some than others. For e.g.  a biased image retrieval model will rank an image higher that resembles data densities in data.
  2. Model sensitivity – A good model needs to generalize well beyond its training data. If model predictions are too scattered or the model is too sensitive to noise in data, we have a high variance model. In other words, it’s  over fitting. 

There are a variety of techniques to overcome these problems and this paper covers some of them beautifully. 

Cost of data collection and annotation

Data (collection) cost is a function of  2 factors a) Availability of data (producers) and b) annotation complexity.

Data collection:

  • For e.g. Getting enough data for rare and remote languages is challenging and expensive. 
  • If we employ collection strategy through product experience, fewer initial users will yield less data. This is typical cold start problem 

Annotation complexity and associated costs: There are 2 primary factors in play here are- 

  • Time taken to label one sample (most annotators are paid hourly) 
  • Desired accuracy of annotation – mostly achieved by increasing human consensus.

Researchers and data scientists employ a variety of techniques to optimize cost and accuracy with multiple pilots. A good human interactivity tasks(HIT) design is one of the core skill that not only improves data quality but can also help reduce cost 

Data shelf life considerations-

Accuracy and validity are both vital for data to be used for model training. Not all accurate data is valid. Spatial nature of data in different domains imposes a different big challenge. E.g. stock market behavior data. Conversation topical data is contextual, cultural and trendy. Meta point here is – the world is always changing, human interaction and standards are always evolving. Predicting something from data which is no longer a representation of current reality will yield sub-optimal output. ML features end up being a dead weight in product are by product of lack of continuous learning of some sort. 

I hope this post highlighted some different perspectives to the data collection methodologies.