Historically, the rate at which academic advances in the field have been commercialized is very low. However, new advances in the emerging field of MLOps aims to accelerate this and make the process smoother and simpler.
To understand what makes commercializing computer vision algorithms such a challenging task, we must explore the nature of the field itself. Machine learning and computer vision fields can trace its origins to 1763, when mathematician Thomas Bayes developed an inference system that used math to update the probability of a hypothesis based on new available information. Bayesian inference networks would serve as the basis for future machine learning innovations. The push for algorithms to be adaptable has driven what many are calling the new age of software, leading to new innovations such as Tesla Full Self Driving and augmenting existing ones such as Google Search.
Since then, we have seen a number of new topics emerge in the field of machine learning: natural language processing can generate meaning from audio, sentiment analysis can extract feelings from different mediums, convolutional neural networks can detect and classify objects from a set of images, and many more.
Computer vision sounds cool, but what is it?
Computer vision is the process of using machine learning to analyze visual content, which can include images, icons, videos, and any other medium that involves pixels or low-level blocks of data. A computer vision algorithm extracts information from pixels by using a series of statistical methods based on the foundational work done by the aforementioned Bayes.
There are two main subcategories of computer vision: object classification and object identification. In classification, a model classifies new objects as belonging to categories identified in a training dataset, e.g., a hot dog or not a hot dog. In identification, a specific instance of an object is recognized or more specifically coordinates of the object are identified within a given image e.g., once a bag of potato chips is classified as Lays, detecting the coordinates associated with the word “Lays” in the image. These two subcategories can be extended from a single image to a sequence of related images (such as in a video), where object classification or identification happens repeatedly across all frames.
A natural extension of this explanation is that the more data you supply a computer vision algorithm, the better it will perform. For example, if one was writing an algorithm to detect actor Matt Damon’s face, the amount of data that needs to be supplied to cover different camera angles and lighting conditions would be a lot if we wanted the algorithm to be better at identifying Matt Damon in every possible condition.