Machine Learning with Unstructured Data
Introduction
Machine learning can be used to process and analyze large sets of unstructured data. For example, files that previously required manual processing. This could involve extracting objects from images or text, building models from unstructured data.
Unstructured data is information that is not arranged according to a preset data model or schema. It can be textual or non-textual and includes sensors, text files, audio and video files. Structured data consists of numbers and values.
Images
Deep learning can be used to extract objects from images. This is done by using a computer vision technique called “selective search” to propose candidate regions or bounding boxes of potential objects in the image. Deep learning workflows for feature extraction can be performed directly in ArcGIS Pro, or processing can be distributed using ArcGIS Image Server as a part. Additionally, deep learning systems can be built to predict a set of non-visual attributes from images.
Deep learning predict non-visual attributes from images by transforming the non-image data into an image-form and using convolutional neural networks (CNNs), which are a class of deep learning neural networks.
Text
Deep learning can also be used to extract objects from text. To read more about NLP applications, you might like to read this general article or this specific one on the subject. This is typically done using a combination of CNNs, recurrent neural networks (RNNs), and connectionist temporal classification (CTC) loss. Pytesseract is a popular library for text extraction in Python that uses RNNs.
Deep learning can also be used to process other types of unstructured data, such as video and audio. Machine learning models can also be built from unstructured data, which makes up more than 80% of all data. Combining structured and unstructured data through deep learning techniques may help improve the performance of predictive models.
Using both structured and unstructured data
Combining structured and unstructured data can help improve the performance of predictive models and reduce errors. Big data analysis requires combining structured and unstructured data to acquire intelligence across these data stores. For example, fusion models can learn better patient representation by combining structured and unstructured data, integrating heterogeneous data types across electronic health records.
I appreciate you reading the article. If you want to share with me your thoughts you can leave a comment here . In addition, if you want to be notified when my next article call comes out you can subscribe. Thanks!