At Glassbeam, we deal with terabytes of data daily. Our SCALAR platform ingests and parses many files in real time. (The files arrive asynchronously and fast!) These files are then picked up for parsing in the order in which they arrive. For such a demanding requirement, we had to design a system which should not only be concurrent and asynchronous but also scalable.
Our customers can send files either through FTP/SFTP/SCP, upload files through UI and so on. We needed a system which could trigger events when files arrive. The file arrival event is then trapped to make some entries in our database for book keeping purposes and further passed on to parsing engine. Our choice was Akka which scales well for this scenario and since we are a predominantly a Scala shop, Akka fits well into the ecosystem. We developed a module and internally we call it “Watcher” using Akka and Linux inotify. There were several challenges implementing such a system although it looks very simple on a first glance. Let’s discuss some of the pitfalls we faced:
Oracle JDK7 WatchService API
The first natural choice to trap file events was to use the JDK7 WATCHSERVICE API. WatchService works by registering a file path on which we expect files to arrive. Once the files start appearing on this path, the API will notify the client code to take action. Internally, this API uses the Linux INOTIFY LIBRARY – the notification system at the filesystem level – and specific to the platform in question. This wasn’t apparent on a first glance. In the development environment everything was working as expected, however on AWS, the notification wasn’t triggering at all.
On closer look it became clear that the Amazon AMI image we were using wasn’t supporting inotify by default. We tried a lot of different AMIs and finally decided on CentOS image which has inotify.
Just when things were going right…
WatchService API’s MODIFY event raised a new issue
For a lot of practical purposes,WATCHSERVICE’S ENTRY_MODIFY EVENT is sufficient for getting a file event. But just getting a file event isn’t enough. The file event should trigger when the file has been completely copied to the watched path. Otherwise, the file may be sent to parsing whilst the file is being copied from the customer end. This is not acceptable and will lead to race conditions. This scenario is especially valid when the files are sent via FTP/SFTP/SCP. When the file is sent via FTP/SFTP/SCP, the files are copied in chunks and WatchService API will treat each chunk as a file and hence trigger as many ENTRY_MODIFY events as the number of chunks. This is the drawback of WatchService API as it cannot let the client code know when the file got completely copied.
The reason that was cited in some of the forums for this seeming limitation is that Oracle JDK aims to achieve portability by letting go some of the platform specific dependency. Thus events such as CLOSE_WRITE which clearly marks the end of file even though the underlying protocol used may be FTP/SFTP/SCP are not supported. Since CLOSE_WRITE event was a vital event, we decided to let go of JDK WatchService API altogether. Instead, we adapted an open source project aimed specifically to address this problem and exposes all the native inotify’s events including CLOSE_WRITE.
We have open sourced this “Watcher” module HTTPS://GITHUB.COM/GLASSBEAM/WATCHER under GPL v3 license.
In conclusion, we realized that although, APIs such as WatchService are designed to be portable, it doesn’t suffice for many scenarios. The API must have had options to enable some of the inotify events thereby using the same API at the discretion of the user. Akka helped to scale and react to events asynchronously and therefore we gained tremendously by not spinning threads ourselves.