
In continuation of part 1 blog post, for most machine learning tasks, a dataset is needed to train and test the output. Arguably, preparing the right datasets is the most critical step to get the right output. Sourcing and collecting the data is one of the primary steps in preparing datasets. Here are few ideas and industry practices on how to acquire data.
CrowdSourcing –
CrowdSourcing is a popular way of data sourcing mainly through questions, games, purpose built apps and surveys. Users contributing data are usually aware of their donation purpose and many times are paid to do so. This method is typically interesting for researchers since data collection could be targeted with carefully drafted experiments for a hypothesis. Tools like Google CrowdSource , Amazon’s mechanical turk are few examples in this category.
In-product data sourcing (in experience) and data network effects –
Use of product as a source of behavioural data is another common technique for data sourcing. This data is typically used for improving existing product features. User data is usually anonymized before training purposes and for the most part, never seen by product teams.
First few user’s product interaction data is used to make the model better, that in turn attracts more users creating a virtuous cycle. This essentially is a data network effect. Google search is one of the best examples of this principle. Search algorithm improvement is based on this virtuous cycle of data and users’ interaction with search results. Tesla autopilot is another great example. Every driver input, theoretically, can be treated as an error; whether the autopilot is engaged or otherwise. This popular technique comes with a caveat – if a product feature is asking for user input, users intern expect the feature to get personalized over time. If that doesn’t happen, it either results in dissatisfaction or noisy data.
Human data collection –
This, to certain extent, is a specialized case of crowdsourcing. In the quest of bootstrapping a model where human behavioural data is non-existent, a specialized group can be tasked to perform certain actions to produce the data. Couple examples to think about are a) training a model for robots, lifting a box in a constrained space b) training a model for an autonomous car for detecting lanes. This data is non-existent since machines never needed to get trained for this.
Extracting data from public sources –
Internet is loaded with textual data. Some of this data can be harvested to learn and represent information that is searchable and help answer user queries. Following 2 are primary sources that comes to mind –
- Published datasets– Google dataset search and kaggle are good avenues to reach for datasets published by Government, academia and tech companies. Research heavy tech companies open source datasets and models with research publications. Few examples from Google research, IBM and Amazon.
- Scrape data from public sources– E.g. Reddit is a great source of dialog or topical conversation data.
Purchasing / licensing data –
Licensing is a primary way of sourcing non-public data. With the advent of machine learning, many businesses have contributed to the data marketplace. Historical twitter data for academic research and aerial imagery from these data vendors are few examples of this.
A word about data collection and labeling infrastructure –
Data management has manifested in a new double digit billion dollar industry. With race to bring intelligence in consumer products, labeling infrastructure and annotation platforms are table stakes. Supervised learning is here to stay for a few more. Big companies have invested in building their own infrastructure and SMB’s are using available SAAS applications. Curious folks in this space should stay tuned.