Boosting ML Model Interoperability and Efficiency with the ONNX framework

The rapid growth of artificial intelligence and machine learning has led to the development of numerous deep learning frameworks. Each framework has its strengths and weaknesses, making it challenging to deploy models across different platforms. However, the Open Neural Network Exchange (ONNX) framework has emerged as a powerful solution to this problem. This article introduces the ONNX framework, explains its basics, and highlights the benefits of using it.

Understanding the basics of ONNX

What is ONNX? The Open Neural Network Exchange (ONNX) is an open-source framework that enables the seamless interchange of models between different deep learning frameworks. It provides a standardized format for representing trained models, allowing them to be transferred and executed on various platforms. ONNX allows you to train your models using one framework and then deploy them using a different framework, eliminating the need for time-consuming and error-prone model conversions.

ONXX framework interoperability

Why use ONNX? There are several significant benefits of using the ONNX framework. First and foremost, it enhances model interoperability. By providing a standardized model format, ONNX enables seamless integration between different deep learning frameworks, such as PyTorch, TensorFlow, Keras, and Caffe. This interoperability allows researchers and developers to leverage the strengths of multiple frameworks and choose the one that best suits their specific needs.

Advantages of using the ONNX framework

ONNX support and capabilities across platforms: One of the major advantages of the ONNX framework is its wide support and capabilities across platforms. ONNX models can be deployed on a variety of devices and platforms, including CPUs, GPUs, and edge devices. This flexibility allows you to leverage the power of deep learning across a range of hardware, from high-performance servers to resource-constrained edge devices.

Simplified deployment: ONNX simplifies the deployment process by eliminating the need for model conversion. With ONNX, you can train your models in your preferred deep learning framework and then export them directly to ONNX format. This saves time and reduces the risk of introducing errors during the conversion process.

Efficient execution: The framework provides optimized runtimes that ensure fast and efficient inference across different platforms. This means that your models can deliver high-performance results, even on devices with limited computational resources. By using ONNX, you can maximize the efficiency of your deep learning models without compromising accuracy or speed.

Enhancing model interoperability with ONNX

ONNX goes beyond just enabling model interoperability. It also provides a rich ecosystem of tools and libraries that further enhance the interoperability between different deep learning frameworks. For example, ONNX Runtime is a high-performance inference engine that allows you to seamlessly execute ONNX models on a wide range of platforms. It provides support for a variety of hardware accelerators, such as GPUs and FPGAs, enabling you to unlock the full potential of your models.

ONNX Runtime

Moreover, ONNX also supports model optimization and quantization techniques. These techniques can help reduce the size of your models, making them more efficient to deploy and run on resource-constrained devices. By leveraging the optimization and quantization capabilities of ONNX, you can ensure that your models are not only interoperable but also highly efficient.

Improving efficiency with the ONNX framework

Efficiency is a critical factor in deep learning, especially when dealing with large-scale models and vast amounts of data. The ONNX framework offers several features that can help improve the efficiency of models and streamline the development process.

One such feature is the ONNX Model Zoo, which provides a collection of pre-trained models that anyone can use as a starting point for projects. These models cover a wide range of domains and tasks, including image classification, object detection, and natural language processing. By leveraging pre-trained models from the ONNX Model Zoo, it saves time and computational resources, allowing to focus on fine-tuning the models for specific needs.

Another efficiency-enhancing feature of ONNX is its support for model compression techniques. Model compression aims to reduce the size of deep learning models without significant loss in performance. ONNX provides tools and libraries that enable you to apply compression techniques, such as pruning, quantization, and knowledge distillation, to your models. By compressing the models with ONNX, you can achieve smaller model sizes, faster inference times, and reduced memory requirements.

Let us see successful implementations of ONNX

To understand the real-world impact of the ONNX framework, let’s look at some use cases where it has been successfully implemented.
Facebook AI Research used ONNX to improve the efficiency of their deep learning models for image recognition. By converting their models to the ONNX format, they were able to deploy them on a range of platforms, including mobile devices and web browsers. This improved the accessibility of their models and allowed them to reach a wider audience.

Microsoft utilized ONNX to optimize their machine learning models for speech recognition. By leveraging the ONNX Runtime, they achieved faster and more efficient inference on various platforms, enabling real-time speech-to-text transcription in their applications.
These use cases demonstrate the versatility and effectiveness of the ONNX framework in real-world scenarios, highlighting its ability to enhance model interoperability and efficiency.

Challenges and limitations of the ONNX framework

While the ONNX framework offers numerous benefits, it also has its challenges and limitations. One of the main challenges is the discrepancy in supported operators and layers across different deep learning frameworks. Although ONNX aims to provide a comprehensive set of operators, there may still be cases where certain operators are not fully supported or behave differently across frameworks. This can lead to compatibility issues when transferring models between frameworks.

Another limitation of the ONNX framework is the lack of support for dynamic neural networks. ONNX primarily focuses on static computational graphs, which means that models with dynamic structures, such as Recurrent Neural Networks (RNNs) or models with varying input sizes, may not be fully supported.

It is important to carefully consider these challenges and limitations when deciding to adopt the ONNX framework for deep learning projects. However, it is worth noting that the ONNX community is actively working towards addressing these issues and improving the framework’s capabilities.

Future trends and developments in ONNX

The ONNX framework is continuously evolving, with ongoing developments and future trends that promise to further enhance its capabilities. One such development is the integration of ONNX with other emerging technologies, such as federated learning and edge computing. This integration will enable efficient and privacy-preserving model exchange and execution in distributed environments.

Furthermore, the ONNX community is actively working on expanding the set of supported operators and layers, as well as improving the compatibility between different deep learning frameworks. These efforts will further enhance the interoperability and ease of using ONNX framework.

To summarize, The ONNX framework provides a powerful solution to the challenges of model interoperability and efficiency in deep learning. By offering a standardized format for representing models and a rich ecosystem of tools and libraries, ONNX enables seamless integration between different deep learning frameworks and platforms. Its support for model optimization and quantization techniques further enhances the efficiency of deep learning models.

While the ONNX framework has its challenges and limitations, its continuous development and future trends promise to address these issues and expand its capabilities. With the increasing adoption of ONNX in both research and industry, this framework is playing a crucial role in advancing the field of deep learning.

For those seeking to enhance the interoperability and efficiency of the deep learning models, exploring the ONNX framework is highly advisable. With its wide support, powerful capabilities, and vibrant community, ONNX is poised to revolutionize the development and deployment of deep learning models for organizations.

At Softnautics, a MosChip company, our team of AIML experts are dedicated to developing optimized Machine Learning solutions specifically tailored for a diverse array of edge platforms. Our expertise covers FPGA, ASIC, CPUs, GPUs, TPUs, and neural network compilers, ensuring the implementation of efficient and high-performance machine learning solutions based on cognitive computing, computer vision, deep learning, Natural Language Processing (NLP), vision analytics, etc.

Read our success stories related to Artificial Intelligence and Machine Learning services to know more about our expertise under AIML.

Contact us at business@softnautics.com for any queries related to your solution design or for consultancy.

[elementor-template id=”12026″]

Boosting ML Model Interoperability and Efficiency with the ONNX framework Read More »

Multimedia and Artificial Intelligence

Multimedia Intelligence: Confluence of Multimedia and Artificial Intelligence

In contrast to traditional mass media, such as printed material or audio recordings, which feature little to no interaction between users, a multimedia is a form of communication that uses a combination of different content forms such as audio, text, animations, images, or video into a single interactive presentation. This definition now seems outdated because coming to 2022, multimedia has just exploded with more complex forms of interactions. Alexa, Google Assistant, Twitter, Snapchat, Instagram Reels, and many more such apps are becoming a daily part of the common man’s life. Such an explosion of multimedia and the rising need for artificial intelligence are bound to collide, and that is where multimedia intelligence comes into picture. Multimedia market is being driven forward by the increasing popularity of virtual creation in the media and entertainment industries, as well as its ability to create high-definition graphics and real-time virtual worlds. The growth is such that between 2022 to 2030, the global market for AI in media & entertainment is anticipated to expand at a 26.9% CAGR and reach about USD 99.48 billion, as per the Grand View Research, Inc. reports.

What is multimedia intelligence?

The rise and consumption of ever-emerging multimedia applications and services are churning out so much data, giving rise to conducting research and analysis on it. We are seeing great forms of multimedia research already like image/video content analysis, video or image search, recommendations, multimedia streaming, etc. Also, on the other hand, Artificial Intelligence is evolving at a faster pace, making it the perfect time for tapping content-rich multimedia for more intelligent applications.
Multimedia intelligence refers to the eco-system created when we apply artificial intelligence to multimedia data. This eco-system is a 2-way give-and-take relationship. In the first relation, we see how multimedia can boost research in artificial intelligence, enabling the evolution of algorithms and pushing AI toward achieving human-level perception and understanding. In the second relation, we see how artificial intelligence can boost multimedia data to become more inferable and reliable by providing its ability to reason. Like in the case of on-demand video streaming applications use AI algorithms to analyse user demographics and behaviour and recommend content that they enjoy streaming or watching. As a result, these AI-powered platforms focus on providing users with content tailored to their specific interests, resulting in a truly customized experience. Thus, multimedia intelligence is a closed cyclic loop between multimedia and AI, where they mutually influence and enhance each other.

Evolution and significance
The evolution of multimedia should be credited to the evolution of smartphones. Video calling through applications like skype, and WhatsApp truly marked that multimedia is here to dominate. This was a significant move because they completely revolutionized long distance communication. This has evolved further to even more complex applications like video streaming apps like discord, twitch, etc. Then AR/VR technology took it a step ahead by integrating motion sensing and geo-sensing into audio, and video.
Multimedia contains multimodal and heterogenous data like images, audio, video, text, etc. together. Multimedia data has become very complex, and this will be incremental. Normal algorithms are not capable enough to co-relate and derive insights from such data and this is still an active area of research, even for AI algorithms it’s a challenge to connect and establish a relationship between different modalities of the data.

Difference between media intelligence and multimedia intelligence
There is a significant difference between media and multimedia intelligence. Text, drawings, visuals, pictures, film, video, wireless, audio, motion graphics, web, and so on are all examples of media. Simply put, multimedia is the combination of two or more types of media to convey information. So, to date, when we talk about media intelligence, we are already seeing applications that exhibit it. Voice Bots like Alexa and Google Assistant are audio intelligent, Chatbots are text intelligent, and drones that recognize and follow hand gestures are video intelligent. There are very few multimedia intelligent applications. To name one: There is EMO – An AI Desktop robot that utilizes multimedia for all its interactions.

Industrial landscape for multimedia intelligence
Multimedia is closely tied to the media and entertainment industry. Artificial Intelligence enhances and influences everything in multimedia.

Landscape for Multimedia Intelligence

Let’s walk through each stage and see how artificial intelligence is impacting them:

Media devices
The media devices that have increasingly become coherent with artificial intelligence applications are cameras and microphones. Smart cameras are not just limited to capturing images and videos these days, but they increasingly do more stuff like detecting objects, tracking items, applying various face filters, etc. All these are driven by AI algorithms and come as part of the camera itself. Microphones are also getting smarter where AI algorithms do active noise cancellations and filter out ambient sounds. Wake words are the new norm, thanks to Alexa and Siri like applications that next-gen microphones are having in-built wake-word or key-phrase recognition AI models.

Image/Audio coding and compression
Autoencoders consists of two components namely encoder, and decoder and are self-supervised machine learning models that use recreating input data to reduce its size. These models are trained as supervised machine learning models and inferred as unsupervised models, hence the name self-supervised models. Autoencoders can be used for image denoising, image compression, and, in some cases, even the generation of image data. This is not limited to images only, autoencoders can be applied to audio data too for the same requirements.
GAN (General Adversarial Networks) are again revolutionary deep neural networks that have made it possible to generate images from texts. OpenAI’s recent project DALLE can generate images from textual descriptions. GFP (Generative Facial Prior)-GAN is another project that can correct and re-create any bad image. AI has shown quite promising results and has proven the feasibility of Deep learning-based image/audio encoding and compression.

Audio / Video distribution
Video streaming platforms like Netflix and Disney Hotstar extensively use AI for improving their content delivery across a global set of users. AI algorithms dominate personalization and recommendation services for both platforms. AI algorithms are also used for the generation of video meta-data for improving search on their platforms. Predicting content delivery and caching appropriate video content geographically is a challenging task that has been simplified to a good extent by AI algorithms. AI has honestly proven its potential to be a game-changer for the streaming industry by offering effective ways to encode, distribute, and organize data. Not just for video streaming platforms, but also for game streaming platforms like Discord, and Twitch and communication platforms like Zoom, and Webex, AI will become an integrated part of AV distribution. 

Categorization of content
On the internet, data is created in a wide range of formats in just a few seconds. Putting stuff into categories and organizing it could be a huge task. Artificial intelligence (AI) steps in to help with the successful classification of information into relevant categories, enabling users to find their preferred topic of interest faster, improving customer engagement, creating more enticing and effective targeted content, and boosting revenue.

Regulating and identifying fake content
Several websites generate and spread fake news in addition to legitimate news stories to enrage the public about events or societal issues. AI is assisting with the discovery and management of such content, as well as with the moderation or deletion of such content before distribution on internet platforms like social media sites. All platforms including Facebook, LinkedIn, Twitter, Instagram, etc. employ powerful AI algorithms in most of their features. Targeted ads services, recommendation services, job recommendations, fraud profile detections, harmful content detections, etc. has AI in it.

We have tried to cover how multimedia and artificial intelligence are interrelated and how they are impacting various industries. Still, this is a broad research topic since media intelligence is still in cogs where AI algorithms are still learning from single media, and we build other algorithms to co-relate them. There is still scope for the evolution of AI algorithms that would understand the full multimedia data in a singularity like how a human does it.

Softnautics has a long history of creating and integrating embedded multimedia and ML software stacks for various global clients. Our multimedia specialists have experience dealing with multimedia devices, smart camera applications, VoD & media streaming, multimedia frameworks, media infotainment systems, and immersive solutions. We work with media firms and the domain chipset manufacturer to create multimedia solutions that integrate digital information with physical reality in innovative and creative ways across a wide range of platforms.

Read our success stories related to Machine Learning services around multimedia to know more about our expertise.

Contact us at business@softnautics.com for any queries related to your solution or for consultancy.

[elementor-template id=”12026″]

 

 

Multimedia Intelligence: Confluence of Multimedia and Artificial Intelligence Read More »

Model Compression Techniques for Edge AI

Model Compression Techniques for Edge AI

Deep learning is growing at a tremendous pace in terms of models and their datasets. In terms of applications, the deep learning market is dominated by image recognition followed by optical character recognition, and facial and object recognition. According to Allied market research, the global deep learning market was valued at$ 6.85 billion in 2020, and it is predicted to reach $ 179.96 billion by 2030, with a CAGR of 39.2% percent from 2021 to 2030. Well, at one point in time it was believed that large and complex models perform better, but now it’s almost a myth. With the evolution of Edge AI, more and more techniques came in to convert a large and complex model into a simple model that can be run on edge and all these techniques combine to perform model compression.

What is Model Compression?

Model Compression is a process of deploying SOTA (state of the art) deep learning models on edge devices that have low computing power and memory without compromising on models’ performance in terms of accuracy, precision, recall, etc. Model Compression broadly reduces two things in the model viz. size and latency. Size reduction focuses on making the model simpler by reducing model parameters, thereby reducing RAM requirements in execution and storage requirements in memory. Latency reduction refers to decreasing the time taken by a model to make a prediction or infer a result. Model size and latency often go together, and most techniques reduce both.

Popular Model Compression Techniques

Pruning
Pruning is the most popular technique for model compression which works by removing redundant and inconsequential parameters. These parameters in a neural network can be connectors, neurons, channels, or even layers. It is popular because it simultaneously decreases models’ size and improves latency.

Pruning

Pruning can be done while we train the model or even post-training. There are different types of pruning techniques which are weight/connection pruning, Neuron Pruning, Filter Pruning, and Layer pruning..

Quantization:
As we remove neurons, connections, filters, layers, etc. in pruning to lower the number of weighted parameters, the size of the weights is decreased during quantization. Values from a large set are mapped to values in a smaller set in this process. In comparison to the input network, the output network has a narrower range of values but retains most of the information. For further details on this method, you may read our in-depth article regarding model quantization here.

Knowledge Distillation
In the Knowledge distillation process, we train a complex and large model on a very large dataset. After fine-tuning the large model, it works well on unseen data. Once achieved, this knowledge is transferred to smaller Neural Networks or models. Both, the teacher network (a larger model) and the student network (a smaller model) are used. There exist two aspects here which is, knowledge distillation in which we don’t tweak the teacher model whereas in transfer learning we use the exact model and weight, alter the model to some extent, and adjust it for the related task.

knowledge distillation system

The knowledge, the distillation algorithm, and the teacher-student architecture models are the three main parts of a typical knowledge distillation system, as shown in the diagram above.

Low Matrix Factorization:
Matrices form the bulk of most deep neural architectures. This technique aims to identify redundant parameters by applying matrix or tensor decomposition and making them into smaller matrices. This technique when applied on dense DNN (Deep Neural Networks) decreases the storage requirements and factorization of CNN (Convolutional Neural Network) layers and improves inference time. A weight matrix A with two dimensions and having a rank r can be decomposed into smaller matrices as below.

Low Matrix Factorization

Model accuracy and performance highly depend on proper factorization and rank selection. The main challenge in the low-rank factorization process is harder implementation and it is computationally intensive. Overall, factorization of the dense layer matrices results in a smaller model and faster performance when compared to full-rank matrix representation.

Due to Edge AI, model compression strategies have become incredibly important. These methods are complementary to one another and can be used across stages of the entire AI pipeline. Popular frameworks like TensorFlow and Pytorch now include techniques like Pruning and Quantization. Eventually, there will be an increase in the number of techniques used in this area.

At Softnautics, we provide AI Engineering and Machine Learning services with expertise on cloud platforms accelerators like Azure, AMD, edge platforms (FPGA, TPU, Controllers), NN compiler for the edge, and tools like Docker, GIT, AWS DeepLens, Jetpack SDK, TensorFlow, TensorFlow Lite, and many more targeted for domains like Multimedia, Industrial IoT, Automotive, Healthcare, Consumer, and Security-Surveillance. We collaborate with organizations to develop high-performance cloud-to-edge machine learning solutions like face/gesture recognition, people counting, object/lane detection, weapon detection, food classification, and more across a variety of platforms.

Read our success stories related to Machine Learning expertise to know more about our services for accelerated AI solutions.

Contact us at business@softnautics.com for any queries related to your solution or for consultancy.

[elementor-template id=”12026″]

 

Model Compression Techniques for Edge AI Read More »

Developing TPU based AI solutions using TensorFlow Lite

Developing TPU based AI solutions using TensorFlow Lite

AI has become ubiquitous today from personal devices to enterprise applications, you see them everywhere. The advent of IoT clubbed with rising demand for data privacy, low power, low latency, and bandwidth constraints has increasingly pushed for AI models to be running at the edge instead of the cloud. According to Grand View Research, the global edge artificial intelligence chips market was valued at USD 1.8 billion in 2019 and is expected to grow at a CAGR of 21.3 percent from 2020 to 2027. On this onset, Google introduced Edge TPU, also known as Coral TPU, which is its purpose-built ASIC for running AI at edge. It’s designed to give an excellent performance while taking up minimal space and power. When we train an AI model, we end up with AI models that have high storage requirements and GPU processing power. We cannot execute them on devices that have low memory and processing footprints. TensorFlow Lite is useful in this situation. TensorFlow Lite is an open-source deep learning framework that runs on the Edge TPU and allows for on-device inference and AI model execution. Also note that TensorFlow Lite is only for executing inference on the edge, not for training a model. For training an AI model, we must use TensorFlow.

Combining Edge TPU and TensorFlow Lite

When we talk about deploying an AI model on Edge TPU, we just cannot deploy any AI model.
The Edge TPU supports NN (Neural Network) operations and designs to enable high-speed neural network performance with low power consumption. Apart from specific networks, it only supports 8-bit quantized and compiled TensorFlow Lite models for Edge TPU.

For a quick summary, TensorFlow Lite is a lightweight version of TensorFlow specially designed for mobile and embedded devices. It achieves low latency results with a small storage size. There is a TensorFlow Lite converter that allows converting a TensorFlow-based AI model file (. pb) to a TensorFlow Lite file (.tflite). Below is a standard workflow for deploying applications on Edge TPU

Let’s look at some interesting real-world applications that can be built using TensorFlow Lite on edge TPU.

Human Detection and Counting

This solution has so many practical applications, especially in malls, retail, government offices, banks, and enterprises. One may wonder what one can do with detecting and counting humans. Data now has the value of time and money. Let us see how the insights from human detection and counting can be used.

Estimating Footfalls

For the retail industry, this is important as it gives an idea if their stores are doing well. Whether their displays are attracting customers to enter the shops. It also helps them to know if they need to increase or decrease support staff. For other organizations, they help in taking adequate security measures for people.

Crowd Analytics and Queue Management

For govt offices and enterprises, queue management via human detection and counting helps them manage longer queues and save people’s time. Studying queues can attribute to individual and organizations’ performance. Crowd detection can help analyze crowd alerts for emergencies, security incidents, etc., and take appropriate actions. Such solutions give the best results when deployed on edge, as required actions can be taken close to real-time.

Age and Gender-based Targeted Advertisements

This solution mainly has practical applications in the retail and advertisement industry. Imagine you walking towards the advertisement display which was showing a women’s shoe ad and then suddenly the advertisement changes to a male’s shoe ad as it determined you being male. Targeted advertisements help retailers and manufacturers target their products better and create brand awareness that a normal person would never get to see in his busy life.

This cannot be restricted to only advertisements, age and gender detection can also help businesses in taking quick decisions by managing appropriate support staff in retail stores, what age and gender people prefer visiting your store, businesses, etc. All this is more powerful and effective if you are very quick to determine and act. So, even more, a reason to have this solution on Edge TPU.

Face Recognition

The very first face recognition system was built in 1970, and to date this is still being developed, being made more robust and effective. The main advantage of having face recognition on edge is real-time recognition. Another advantage is having face encryption and feature extraction on edge, and just sending encrypted and extracted data to the cloud for matching, thereby protecting PII level privacy of face images (as you don’t save face images on edge and cloud) and complying with stringent privacy laws.

Edge TPU combined with the TensorFlow Lite framework opens several edges AI applications opportunities. As the framework is open-source the Open-Source Software (OSS) community also supports it, making it even more popular for machine learning use cases. The overall platform of TensorFlow Lite enhances the environment for the growth of edge applications for embedded and IoT devices.

At Softnautics, we provide AI engineering and machine learning services and solutions with expertise on edge platforms (TPU, Rpi, FPGA), NN compiler for the edge, cloud platforms accelerators like AWS, Azure, AMD, and tools like TensorFlow, TensorFlow Lite, Docker, GIT, AWS DeepLens, Jetpack SDK, and many more targeted for domains like Automotive, Multimedia, Industrial IoT, Healthcare, Consumer, and Security-Surveillance. Softnautics helps businesses in building high-performance cloud and edge-based ML solutions like object/lane detection, face/gesture recognition, human counting, key-phrase/voice command detection, and more across various platforms.

Read our success stories related to Machine Learning expertise to know more about our services for accelerated AI solutions.

Contact us at business@softnautics.com for any queries related to your solution or for consultancy.

[elementor-template id=”12026″]

Developing TPU based AI solutions using TensorFlow Lite Read More »

Model Quantization for Edge AI

Model Quantization for Edge AI

Deep learning is witnessing a growing history of success, however, the large/heavy models that must be run on a high-performance computing system are far from optimal. Artificial intelligence is already widely used in business applications. The computational demands of AI inference and training are increasing. As a result, a relatively new class of deep learning approaches known as quantized neural network models has emerged to address this disparity. Memory has been one of the biggest challenges for deep learning architectures. It was an evolution of the gaming industry that led to the rapid development of hardware leading to GPUs that enables 50 layer networks of today. Still, the hunger for memory by newer and powerful networks is now pushing for evolutions of Deep Learning model compression techniques to put a leash on this requirement, as AI is quickly moving towards edge devices to give near to real-time results for captured data. Model quantization is one such rapidly growing technology that has allowed deep learning models to be deployed on edge devices with less power, memory, and computational capacity than a full-fledged computer.

How did AI Migrate from Cloud to Edge?

A computer examines visual data and searches for a specified set of indicators, such as a person’s head shape, depth of their eyelids, etc. A database of facial markers is built, and an image of a face that matches the database’s essential threshold of resemblance suggests a possible match. Face recognition technologies, such as machine vision, modelling and reconstruction, and analytics, require the utilization of advanced algorithms in the areas of Machine Learning – Deep Learning and CNN (Convolutional Neural Network), which is growing at an exponential rate.

Edge AI mostly works in a decentralized fashion. Small clusters of computer devices now work together to drive decision-making rather than going to a large processing center. Edge computing boosts the device’s real-time response significantly. Another advantage of edge AI over cloud AI is the lower cost of operation, bandwidth, and connectivity. Now, this is not easy as it sounds. Running AI models on the edge devices while maintaining the inference time and high throughput is equally challenging. Model Quantization is the key to solving this problem.

The need for Quantization?

Now before going into quantization, let’s see why neural network in general takes so much memory.

Elements of ANN

As shown in the above figure a standard artificial neural network will consist of layers of interconnected neurons, with each having its weight, bias, and activation function. These weights and biases are referred to as the “parameters” of a neural network. This gets stored physically in memory by a neural network. 32-bit floating-point values are a standard representation for them allowing a high level of precision and accuracy for the neural network.

Getting this accuracy makes any neural network take up much memory. Imagine a neural network with millions of parameters and activations, getting stored as a 32-bit value, and the memory it will consume. For example, a 50-layer ResNet architecture will contain roughly 26 million weights and 16 million activations. So, using 32-bit floating-point values for both the weights and activations would make the entire architecture consume around 168 MB of storage. Quantization is the big terminology that includes different techniques to convert the input values from a large set to output values in a smaller set. The deep learning models that we use for inferencing are nothing but the matrix with complex and iterative mathematical operations which mostly include multiplications. Converting those 32-bit floating values to the 8 bits integer will lower the precision of the weights used.

Quantization Storage Format 

Due to this storage format, the footprint of the model in the memory gets reduced and it drastically improves the performance of models. In deep learning, weights, and biases are stored as 32-bit floating-point numbers. When the model is trained, it can be reduced to 8-bit integers which eventually reduces the model size. One can either reduce it to 16-bit floating points (2x size reduction) or 8-bit integers (4x size reduction). This will come with a trade-off in the accuracy of the model’s predictions. However, it has been empirically proven in many situations that a quantized model does not suffer from a significant decay or no decay at all in some scenarios.

Quantized Neural Network model 

How does the quantization process work?

There are 2 ways to do model quantization as explained below:

Post Training Quantization:

As the name suggests, Post Training Quantization is a process of converting a pre-trained model to a quantized model viz. converting the model parameters from 32-bit to 16-bit or 8-bit. It can further be of 2 types. One is Hybrid Quantization, where you just quantize weights and do not touch other parameters of the model. Another is Full Quantization, where you quantize both weights and parameters of the model.

Quantization Aware Training:

As the name suggests, here we quantize the model during the training time. Modifications are done to the network before initial training (using dummy quantize nodes) and it learns the 8-bit weights through training rather than going for conversion later.

Benefits and Drawbacks of Quantization

Quantized neural networks, in addition to improving performance, significantly improve power efficiency due to two factors: lower memory access costs and better computation efficiency. Lower-bit quantized data necessitates less data movement on both sides of the chip, reducing memory bandwidth and conserving a great deal of energy.

As mentioned earlier, it is proven empirically that quantized models don’t suffer from significant decay, still, there are times when quantization greatly reduces models’ accuracy. Hence, with a good application of post quantization or quantization-aware training, one can overcome this drop inaccuracy.

Model quantization is vital when it comes to developing and deploying AI models on edge devices that have low power, memory, and computing. It adds the intelligence to IoT eco-system smoothly.

At Softnautics, we provide AI and Machine Learning services and solutions with expertise on cloud platforms accelerators like Azure, AMD, edge platforms (TPU, RPi), NN compiler for the edge, and tools like Docker, GIT, AWS DeepLens, Jetpack SDK, TensorFlow, TensorFlow Lite, and many more targeted for domains like Multimedia, Industrial IoT, Automotive, Healthcare, Consumer, and Security-Surveillance. We can help businesses to build high-performance cloud-to-edge Machine Learning solutions like face/gesture recognition, human counting, key-phrase/voice command detection, object/lane detection, weapon detection, food classification, and more across various platforms.

Read our success stories related to Machine Learning expertise to know more about our services for accelerated AI solutions.

Contact us at business@softnautics.com for any queries related to your solution or for consultancy.

[elementor-template id=”12026″]

Model Quantization for Edge AI Read More »

Scroll to Top