Reducing Errors In Data from 10 Million IoT Devices

octubre 28, 2022

A global technology company wanted to improve the quality of services and have the ability to debug issues down to a granular level utilizing data from ten million IoT devices (Wi-Fi routers, Wi-Fi networking devices, Wi-Fi photo frames, etc.). The previous system was plagued with data quality and data reliability issues, high data acquisition and ingestion costs, frequent pipeline outages, and high error rates. NTT DATA helped a global technology company significantly reduce error rates and monthly data acquisition costs in one month, by reducing costs from assessment to deployment.

The Challenges

1. Collecting data from 10 million IoT devices and numerous data types

The devices were sending more than 20 different types of real-time data events independently. They were receiving data continuously with over 400 events per second. This data is generated by each device in two ways:

Pseudorandom (for daily events)
Random times (event-based data)

Examples of these events include:

Daily events where the devices send over 300 fields of configuration details
Thermal events where the devices send information in an array of varying sizes, depending on the device’s existing sensors, about the temperature of each sensor on the device
Kernal reboot events where the devices send basic information about the device, a reboot reason, a device uptime before the reboot, and a detailed kernel log

2. Sending (and sometimes losing) randomly large datasets

The previous architecture could not handle the high volume and this created outages resulting in unreliable analytics. The lead engineer on the project, explained: “A huge amount of data is always coming into the system, and we didn’t want to lose the data. If data is lost, it impacts analytics. Considering the number of devices that are online for example, if someone purchases a device and the device doesn’t reflect online status — that is a bigger concern for the company. And the questions to consider for diagnosing the issue are “why is the customer not using the device?” “Is the customer-facing a configuration issue, or is the customer not able to use the device, or is the device faulty?” They don’t want to lose any information or any event. If this data is lost, then the analytics wouldn’t be reliable — this information is used by product engineers to improve quality of service for the devices.”

3. Geographically dispersed IoT devices across multiple time zones without synchronized clocks

The devices may send a timestamp into the payload but the timestamps are derived from different time zones.

4. Possibility of generating invalid records

A corrupted record would lead to other records being dropped since in many cases a block of records was treated as an inclusive record set. There is a high possibility of IoT devices sending invalid data such as partial records due to power interrupts, non-UTF character sets, or invalid JSON structures.

5. High data acquisition costs

Ingestion costs were high as each IoT message was routed through 20 IoT rules and it was charged for each rule, regardless of whether it was utilized or not.

The Solution

The 10 million IoT devices deployed by customers scattered all over the globe are sending telemetry data to AWS IoT core using an extremely lightweight pub/sub messaging protocol called MQTT. Further telemetry data is filtered out through IoT rules and preprocessed using AWS Lambda. Data is further buffered, compressed by Amazon Kinesis Data Firehose, and placed into an S3 bucket to stage into the Snowflake Data Cloud.

To address the high ingestion costs, the IoT rules needed to be simplified. At the time of redesigning, there were around 20 different IoT rules. Jalindar’s team simplified the 20 rules into one single rule; this saved the client a significant amount in costs because every message which was coming from a device was running through all 20 of the rules. Every message carried an individual cost and every message went to every rule; when a message goes to a rule, whether it passes the rule or fails the rule, the cost is incurred.

The client previously used Apache NiFi for data flow and processing but it was causing reliability issues and was costly. The NTT DATA team created an AWS Lambda function to replace the NiFi processing. They were looking to also save money using Lambda functions since you only pay for the compute time you use — by the millisecond. The Lambda function converts the multi-line records into a single line and then sends them on to an AWS Kinesis Data Firehose.

For the AWS Kinesis Firehose, Jalindar’s team incorporated a buffer set that had a time as well as the file size limit. Whichever is reached first, the file size limit or the time limit, then the file is created and placed into an S3 bucket. According to Jalindar, “The client used to store data in two S3 buckets: one was for storing uncompressed data from IoT rules and another compressed data produced by NiFi. The uncompressed S3 bucket storage cost was directly removed as we were able to directly process data from rules to Lambda to Kinesis, then buffered in Kinesis, and then added it into a staging table. Kinesis is actually compressing the data into a 10MB file in GZIP format before placing it into the S3 bucket. In production, it takes only a few seconds to build that 10MB file and then we put it into the S3 bucket which is used as staging to the Snowflake Data Cloud.”

Another key aspect was the error rate with the previous implementation. The NTT DATA team analyzed the error rate, but they could not stop each and every error as devices were sometimes sending bad records. Instead of failing to load data with errors, the team started asking, “What are these errors?”. Jalindar decided to collect erroneous data as a separate data source, and then, share that data source with the client who can now understand what kind of JSON records are failing, on what type of device they’re failing, what firmware version, or what release is actually throwing off a higher number of failures

To accomplish that, the NTT DATA team converted the multi-line JSON into a single-line format. Jalindar explained, “JSON records generated by IoT Devices are a multi-line format, starting with a curly bracket and ending with a curly bracket. What if the ending curly bracket is missed somehow because some invalid error? In that case, it is very hard to find out where the next record is starting, so a few of the records are just lost in that way. What if we place every JSON record into a single line? If something goes wrong, just skip that particular line. In that way, we can reduce the number of errors, so we implemented that while taking data from the staging bucket into the Snowflake Data Cloud.”

The last solution to be highlighted in this case is deployment. It’s not possible to stop 10 million IoT devices during deployment because devices are still sending data in real-time. Jalindar’s team needed to switch from an old design to a new design on the fly. “We implemented a parallel flow: we kept the old system running as it was, implemented the new solution approach, and created a dummy sink into Snowflake for validation. We simply created a copy of a data source and kept it ingesting for a week to validate that our approach was delivering a better result. And it was a great result — costs are much less and the error rate is reduced to very high extent, and whatever errors are there, we are able to keep those records in another table for further analysis.”

The Results

“We ran everything in parallel for seven days, and it was fantastic — everyone was happy. Then we just switched the new flow from the dummy sink over to the original Snowflake analytics and stopped the old flow, so zero downtime was achieved while migrating the system.”

In the previous design, once error rates occurred, those records were discarded and the client didn’t have any idea where or why the errors were occurring. In the new design, if a device sent a bad request, the record containing the error rate is retained so it can be located and diagnosed.

Consequentially the error rates were reduced by 97.6%. Because of the reduced error rate, the client gained significantly more reliable and accurate analytics. With accurate and on-time analytics, the product engineering team was able to use this data for debugging as well as improving the quality of service for devices resulting in large benefits to their IoT device consumers.

The client can now find out exactly which device is producing a bad record or having an issue, and specifically, which firmware or which release version is having a problem. The product engineering team is able to quickly pinpoint issues and provide a fix for that particular area. Previously, finding the exact error was very time-consuming, but now that they are collecting bad record samples, issues are addressed much faster.

Monthly data acquisition and ingestion costs were reduced by 77.3% (from $22,000 to $5,000). The cost was reduced due to reduced number of IoT rules, complete removal of the uncompressed S3 bucket, and replacement of the NiFi cluster with AWS Lambda serverless functions.

End-to-end solution delivery was completed in one month. To implement this particular solution, from assessment to deployment, was just one month. The NTT DATA team worked quickly to save the client time and give them immediate, cost-saving benefits.

Reducing Errors In Data from 10 Million IoT Devices

The Challenges

The Solution

The Results

Related Blog Posts

Servicios

Resources (Recursos)

Empresa