Open Source Software

[CloudRAID] 6. Conclusion and Outlook

This is the last post of the series of posts about the student research paper CloudRAID.

The predecessor can be found in Markus Holtermann’s blog.

6. Conclusion and Outlook

CloudRAID logo

Figure 25: CloudRAID logo [Sch12]

CloudRAID is a conceptual implementation of the idea bringing the RAID concept into the cloud. It shows how easy and efficient the distributed cloud storage is. CloudRAID implements a full REST API allowing developers to create their own clients. But unfortunately it lacks a fully featured user and administration interface to handle user accounts.

Further development could deal with the support of more RAID versions than RAID5. Also the creation of privileged users could be implemented. The support for hierarchical paths for uploaded files is a topic that can be worked on too.

Another subject of further development could be a plug-in interface. This means that software developers could create OSGi bundles implementing certain interfaces and providing certain services; these bundles would be triggered at certain actions – for example at CRUD actions, user creations, or password changes etc.

A possible plug-in could be an indexing application creating a search index for all uploaded files. The contents of the backed-up (text) files would be searchable which would improve the usability of CloudRAID a lot. Such an indexer could be realized with the well-known Apache Lucene35 library and subject of a further student research project.

Additionally the availability of the CloudRAID server can be increased by supporting a distributed application where different parts of the application run on different machines. This may lead to some bigger changes in the way bundles communicate with each other, but the current implementation already was developed keeping this in mind.

Another subject of improvements is the client software. The implementation of a “sync client” could improve the usability a lot. “Sync client” means that the client automatically keeps track of changes of files on the local file system as well as the CloudRAID server and downloads respectively uploads the new file versions without any explicit user interaction.

From the cryptographic perspective, there are multiple ways to increase the security of CloudRAID, too. The most simple enhancement would cover a client-side encryption of files. Thus, even if the CloudRAID server is compromised, the data itself is encrypted. More complicated and complex improvements affect the way the meta data is computed. Adding a user id or another unique key to their hash sums will prevent attacks on the storage provider side. Besides, the encryption algorithm could be changed from RC4 to AES.


35 https://lucene.apache.org/


Conrad Schmidt. CloudRAID Logo, August 2, 2012.

Open Source Software

[CloudRAID] 4. Implementation (Continuation)

This post is part of a series of posts about the student research paper CloudRAID.

The predecessor can be found in Markus Holtermann’s blog.

4.3 Compression in RESTful API

For a better performance regarding the network usage the RESTful API was implemented to support two different compression standards. Especially for data that is not stored on the file system using compression algorithms (such as JPEG or PNG image files) the amount of data to be transferred can be reduced drastically.

Most web servers support the compression algorithms gzip29 and deflate30. Both algorithms are free of patents and therefore part of the Java standard libraries.

The client software can announce whether it supports a compression algorithm. If an HTTP request contains a header field indicating a compression algorithm the request’s body must also be compressed using this algorithm.

Since there are different understandings of what the “deflate” keyword means the preferred compression algorithm is the gzip option. This is – although deflate is standardized by the aforementioned RFC – because “deflate” means in HTTP terms that data that was compressed using deflate is sent with additional zlib headers31. But some implementations may expect raw deflate data to be sent32. An erroneous handling of the deflate keyword may therefore lead to problems. This is the reason why the gzip algorithm should be used.

The client has four options regarding the compression:

  1. No header
  2. Accept-Encoding: gzip
  3. Accept-Encoding: deflate
  4. Accept-Encoding: gzip, deflate or Accept-Encoding: deflate, gzip

The first option will result in no compression, the second and third in the regarding compression algorithm. The fourth option will cause the gzip algorithm to be used – independent from which algorithm was listed first.

As mentioned above the usage of compression algorithms like gzip can have a positive effect on bandwidth usage and transfer times of data. In a test a tex file with a size of 118 KiB containing plain text could be compressed by 68.6% to 37 KiB. A png file with a size of 17 KiB could only be compressed by 11.7% to 15 KiB. The client software could therefore use compression only when necessary.

If needed, the support of other compression algorithms can easily be implemented for the RESTful API. Either the Java interface IRestApiResponse (listing 12 on page 51) is completely implemented or the Java class PlainApiResponse is extended as in listing 13 on page 51.


Listing 12: IRestApiResponse can be used to implement an own compression support


Listing 13: GZIPPlainApiResponse overrides a single method of PlainApiResponse

4.4 Client Software

As described above the client software providing the user interface can have a very simple architecture. Since every relevant action (regarding splitting, merging and meta data handling) is done by the CloudRAID server application the client simply wraps the RESTful API and provides an intuitive user interface.

The CloudRAID client software does not base on OSGi but is a conventional Java application.

4.4.1 Core

The CloudRAID client software consists of three components. The first component is the Java code actually wrapping the RESTful API. It provides an abstraction that supports every functionality provided by the RESTful API. HTTP error codes are transferred to Java exceptions containing the error information as found in the tables above. This component is called core.

Client core

Figure 17: Client core component.

Figure 17 on page 52 shows the architecture of the core component using an UML class diagram. The ServerConnection class stores the information needed to connect to the CloudRAID server. The ServerConnector class uses such a connection to handle the traffic from and to the server. It stores the session after logging in and validates that client and server use the same API version.

It also holds a list of DataPresenters. Every (registered) implementation of this interface gets the current file list as soon as the ServerConnector retrieved a file list from the server. The files stored by CloudRAID are represented by CloudFile objects.

4.4.2 GUI and CLI

The other two parts are user interfaces using the first component. The specific methods are called on user request and the exceptions are shown in a human readable form.

On component is a command line client, the other one a graphical client (see Figure 18). While the CLI is only available in English, the GUI has a multi-language support. Currently there are language files for English and German. Customized language files can be generated easily.

Screenshot GUI

Figure 18: The graphical CloudRAID client before login.

Screenshot GUI after login

Figure 19: The graphical CloudRAID client after login.


29 https://tools.ietf.org/html/rfc1952

30 https://tools.ietf.org/html/rfc1951

31 https://tools.ietf.org/html/rfc2616

32 http://www.gzip.org/zlib/zlib faq.html#faq38

Open Source Software

[CloudRAID] 3. Concept (Continuation)

This post is part of a series of posts about CloudRAID.

The predecessor can be found here.

The successor can be found in Markus Holtermann’s blog.

3.3.3 Database Design

CloudRAID uses – as described above – an HSQL database for storing meta data of split files and information of user accounts. The database design can be very simple since the server application does not need much information.

The database consists of two tables – one table to store the user accounts, one to store the file metadata (see Figure 14 on page 24).

The cloudraid users table stores a unique user ID, a unique user name, the encrypted password, and a salt needed to encrypt the password securely.

The cloudraid files table stores a unique file ID, a path name (which is the file name), a hash of the path name (used to name the files uploaded to the cloud storage services), the date of the last upload of this file, the file’s status, and the user ID of the user the file belongs to (which is a foreign key to the cloudraid users table’s ID column). The user ID together with the path name is the unique key of this table.

Between both tables exists an n:1 -relationship – n files belong to 1 user, a file cannot belong to more than one user.

Database design

Figure 14: Database design of the CloudRAID HSQLMetadataManager.

Since core and RESTful need file information but they should not make any assumptions about how a Java ResultSet looks like all file datasets from cloudraid files are transferred into an object representation. This object representation is defined by the ICloudFile interface in the interfaces bundle (see listing 2 on page 24). The actual implementation of ICloudFile is done in the regarding metadatamanager bundle and can be dependent on the database or other storage form.

ICloudFile interface

Listing 2: The ICloudFile interface.

Core and RESTful also need a way to retrieve ICloudFiles, and register new users and files. This is done via the IMetadataManager service. The IMetadataManager is defined as shown in listing 3 on page 25.

IMetadataManager interface

Listing 3: The IMetadataManager interface.

General methods are the connect() and the disconnect() method. They are the first, respectively last method that is called. Another general method is initialize(); after connecting to a database, this method is called to ensure that there are all relevant database tables with correct constraints and dependencies in the database.

The authUser() method checks, if a user tried to authorize with the correct credentials. If so, it returns the unique ID of this user – which is then used for other method calls to identify the user – else, it returns -1.

addUser() creates a new user in the database and returns true, if the creation was successful.

Most of the file-related methods have self-explanatory names. fileById() gets the file representation of a file whose ID is known while fileGet() returns the file representation when the owner and the path name are known.

3.3.4 Server API

As described in chapter 3.3.1 – Core on page 20, the interface between CloudRAID client and server software is implemented as a RESTful API. The advantages are that it is based on the well-known HTTP protocol and can easily be implemented using OSGi bundles shipped with Equinox.

Caused by the modularity of the CloudRAID server application further APIs can be implemented. Another API could be for example a WebDAV19 API.

The API implementation has only to get the Core bundle’s ICoreAccess service implementation (for ICoreAccess see listing 4 on page 26).

The putData() methods are used to send file data from the API to the Core bundle. The methods getData() and finishGetData() are used to send file data from the Core bundle to the API. deleteData() sends the deletion request for a certain file to the Core bundle. reset() resets the internal state of the ICoreAccess implementation.

ICoreAccess interface

Listing 4: The ICoreAccess interface.

3.3.5 Java Native Interface

An important decision when designing the server application was to implement the RAID functionality in the C programming language and to include it via the Java Native Interface.

To use JNI the developer defines “native” methods in Java classes. These native methods are similar to abstract methods; they also have no function bodies – the bodies are implemented by the external C (or C++) libraries. In a class that defines native methods the C library has to be loaded.

Using the javah command that ships along with the Java Software Development Kit (SDK) a C header file is generated that defines the function signatures for the implementations of the native methods in the Java class.

A big advantage of JNI is that performance critical software parts can be executed faster since programs written in the C programming language are mostly much faster, especially regarding hardware access (disk I/O) or handling data on bit level.

The decision for using JNI was taken because of some tests and benchmarks comparing a Java, Python and C implementation of a RAID level 5 functionality (see chapter 6.2 – Comparison of Java, Python and C on page 55).

But using JNI has also some disadvantages: The C code may not be platform independent. This means that slightly different implementations must be developed for Unix and Microsoft Windows machines. Additionally, JNI can be a source of memory leaks; but the chance of a memory leak can be reduced by good code and suitable software tests.

Balancing the disadvantages against the much higher speed and therefore much shorter runtime vindicates the potential memory leaks, from our point of view.

3.3.6 RAID and Encryption Design

Given that CloudRAID should provide redundancy of the stored data, various RAID levels have been considered that have been introduced in chapter 2.3 – Background on RAID Technology on page 12. Justified by the requirement for high data throughput and a minimum of additionally space for redundancy requirements we made the decision to use RAID level 5, based on the facts that are shown in the same place too.

As underlying cryptographic algorithm RC4 is used. This decision was made because this algorithm is used in acquainted environments like the BitTorrent protocol20, the Secure Socket Layer (SSL) and Microsoft Point-To-Point Encryption (MPPE) protocol21. Besides that, a stream cipher can more easily handle various lengths of bytes than a block cipher which would require a proper padding of the input.

Furthermore, RC4 has been chosen because of its simplicity and speed, both during key setup and the encryption and decryption calls. As shown in figures 2 and 3 of [PK01] and in all diagrams and tables shown in [NS11], is much faster and much more efficient than a common block cipher like AES and furthermore its speed is independent of the key size.

The cipher key should be enhanced by a salt because this will makes the whole cipher key more secure and unpredictable. The usage of Message Authentication Codes (MACs) or Hash-based Message Authentication Codes (HMACs) can be reasonable and would lead to the desired security.

The encryption and decryption integration into the RAID split and merge processes is directly after reading the original input file during split and right before writing the merged data to the output file. This simplifies and speeds up both processes many times. First of all, only one key needs to be managed because the split and merge is done with the encrypted data. There is no “part” or “device”-related encryption. This leads to the decreased runtime, since there is only one key setup done before starting the split or merge.

Split process operation flowchart

Figure 15: Split process operation flowchart

The split process of a file uploaded to the CloudRAID service will basically read the file block-wise with twice the internal RAID block-size, split the file into three parts (each having at most the internal block-size size) and finally writes the parts to the device files. By continuously changing the position of the parity, the effect of moving it over all devices, can be achieved. Hence the mentioned bottleneck will not occur that likely.

The diagram alongside shows the process operation for the complete split process. To provide a high encrypted strength, the split process must generate a random salt. The better the randomness of the salt is the higher the encryption level is due to the concept and implementation of RC4. The salt is combined with a given key and therefore a high confidentiality can be accomplished.

After the full key has been generated, the input file is read block-wise. Every block that is read will first be encrypted and split after that. During reading the input file and writing the parts to the device files, the check sums for the three device files and for the input file will be computed. Since the Secure Hash Algorithm (SHA)-2 function is generally capable handling this kind of iterative updates of the hashing context, a second file access on all of the four files is not necessary. As a matter of course, this will reduce the time to complete the split process by a huge amount of time.

The last but not less essential part of the overall split process is depicted in the final process box: writing the meta data. The meta data file will contain a lot of information about how the data is organized in the device files. Besides the check sums, the salt will be kept there.

Merge process operation flowchart

Figure 16: Merge process operation flowchart

The merge process is the reverse to the split process. It will take two to three device files and merge them together, leading to the final output file.

In contrast to the split process, the merge process is much more complicated and time expensive at the first glance. As one can see in figure 16 on page 29 there are two read accesses to the device files. This will inevitably increase the runtime. But nevertheless, both read accesses are necessary in order to provide data integrity checks. The check sums of the device files must be calculated before any actual merge of these files may happen. Taking the computed hash sums and comparing them with the hash sums from the given meta data file will show any inconsistency of broken device files. If more than one hash is incorrect, and therefore no successful merge will be possible, the merge process must stop. In other cases, if at least two hash sums are valid, the actual merge is going to start.

As the salt used to enhance the key must be taken from the meta data, the complete key for decryption must be generated. Afterwards the data is read from the three device files and merged into a single output. This output then needs a decryption as it has been encrypted before.

3.4 Client Architecture

Caused by the usage of the three layer architecture of the overall application the client application can have a very simple architecture. It can consist of two components:

  1. The network component wraps the server’s REST API. It sends requests that may contain files, interprets the HTTP response codes and handles them by throwing appropriate exceptions.
  2. The representation component uses the network component by giving required parameters form user inputs and handling the responses and exceptions.


19 Web-based Distributed Authoring and Versioning

20 http://en.wikipedia.org/wiki/BitTorrent_protocol_encryption

21 MPPE: http://tools.ietf.org/html/rfc3078


J. P. S. Raina Nidhi Singhal. Comparative Analysis of AES and RC4 Algorithms for Better Utilization. International Journal of Computer Trends and Technology (IJCTT), 1(3):177 – 181, July – August 2011. http://www.ijcttjournal.org/volume-1/Issue-3/IJCTT-V1I3P107.pdf.

P. Prasithsangaree and P. Krishnamurthy. Analysis of Energy Consumption of RC4 and AES Algorithms in Wireless LANs, July 31, 2001. http://www.sis.pitt.edu/∼is3966/group5_paper2.pdf, Global Telecommunications Conference, 2003. GLOBECOM ’03.

Open Source Software

[CloudRAID] 3. Concept

This post is part of a series of posts about CloudRAID.

The predecessor can be found in Markus Holtermann’s blog and the successor here.

3 Concept

3.1 Requirements

The application has to meet several requirements. At first there are data-security, data-safety and data-availability derived from the weaknesses of cloud storage services (see chapter 2.2 Cloud on page 4).

Since CloudRAID is supposed to be a cloud backup solution there does not have to be a complex synchronization functionality as it is provided by several cloud storage solutions.

3.2 General Architecture

A typical three layer architecture was considered the best solution.

The persistence layer is an aggregate of three cloud storage solutions as a RAID 5 unit.

The application layer is an application running on a server that provides a simple Representational State Transfer (REST) Application Programming Interface (API) to the presentation layer. The connection to the persistence layer is realized via API calls to the cloud storage services’ web service interfaces. On this layer the RAID functionality is implemented and also the encryption of the files. For security reasons the application layer and the presentation layer should communicate via SSL/TLS encrypted lines.

Three layer architecture

Figure 9: three layer architecture

The application layer needs a local14 storage for meta information about files. This information can be used to effectively find RAID5 chunks on different storages. The user may also want to have access to the information of the files’ back-up dates etc.

The presentation layer is a simple Graphical User Interface (GUI), Command Line Interface (CLI) or maybe a website. It wraps the REST API and presents the result of different REST calls to the user.

3.2.1 Advantages

The advantage of this approach is that it is easier to react to changes in the APIs of cloud storage providers. Since one server can handle different clients and their access to the cloud storages, only the server instance has to be updated to the latest API version. The client software does not have to be updated.

Additionally, users can simply access their backups from different end user equipment and from different locations. For this one only needs to establish a connection to the second layer server and not to three cloud storages.

Another advantage is that the user equipment only has to transfer the “normal” amount of data to back up a file. A solution where the client software transfers the data directly to the cloud storage servers on its own would cause about 1.5 times more data traffic. But this argument only counts for mobile devices that have a slow Internet connection and may be charged for consumed traffic.

3.2.2 Disadvantages

The biggest disadvantage of this approach is that there is a single point of failure: If the server application is not available because it crashed or because its Internet connection is broken, it is not possible to access the backed up files. An architecture of the server application that allows using more than one server instance to back up files might overcome this disadvantage.

One point more between storage and end user equipment means one point more where an attack or security leak can occur. Therefore the server application has to be secured by encrypted data transfer. Additionally, it has to be ensured by suitable tests etc. that the server application is secure.

3.3 Server Architecture

A first concept for the server’s general architecture based on OSGi. Several bundles were defined to give a good modularity (see Figure 10).

  • Java Native Interface (JNI) RAID contains the logic needed to split files to RAID-5-chunks using an external C library.
  • HSQL Driver contains the HSQL Java DataBase Connectivity (JDBC) driver.
  • DBAccess uses the HSQL Driver to implement classes needed to access a local database to store different data.
  • DBInterface is a single Java interface that defines the methods for the database access. It is implemented by DBAccess. The implementation of this interface is exported as an OSGi service so that the DBAccess package can be replaced by implementations for other databases.
  • PWD Manager contains the logic for password management.
  • PWDInterface is the Java interface that defines the methods for password management so that the concrete implementation can be replaced by other implementations.
  • Jetty is a web server basing on Java.
  • Jersey provides REST management for Java.
  • JSON contains JSON-java15 as an OSGi bundle and is used to send server responses to the client in form of JSON-encoded messages.
  • Scribe contains Scribe-java16 as an OSGi bundle and provides OAuth-encryption needed for different cloud storage APIs.
  • REST Service uses Jetty, Jersey and JSON to provide the interface to the presentation layer.
  • Core combines the different packages and implements the interface to the persistence layer.

Server architecture

Figure 10: Server architecture


Figure 11 on page 19 shows the actual architecture of the server application on OSGi bundle level.

  • Interfaces contains only Java interfaces defining the behavior of different bundles. By implementing an interface of this bundle another bundle is able to offer the interface implementation as an OSGi service.
  • Core statically imports only the interfaces bundle. This is to be able to get the service definitions and dynamically load services at runtime. Core provides the basic functionality for CloudRAID: merging and splitting of files. For this it uses the RAID level 5 implementation by accessing a shared object (Unix) or a DLL (Windows) via JNI.
  • MetadataManager provides the persistence functionality to store meta data of files (hash, status etc.) or user information (name, password). For the standard CloudRAID implementation the MetadataManager imports the HSQL JDBC driver to store the meta data in an HSQL database. Because of the usage of services it is possible to change the MetadataManager implementation (also at runtime) to another implementation, for example providing access to a MySQL, DB2, or Oracle database.
  • Config is the bundle that loads the configuration of the CloudRAID server. In the standard implementation the configuration is read from a partially encrypted XML file. Passwords needed for logging into cloud storage services etc. are stored encrypted. Other parameters are stored plain text. Of course the Config implementation can be dynamically changed so that the configuration is read for example from a database.
  • PasswordManager provides the functionality to get the master password of the CloudRAID server. The password is read at server start-up. The development version of CloudRAID contains a hard coded password. Later versions will use more sophisticated and secure ways.
  • RESTful is the RESTful API. It starts an Hypertext Transfer Protocol (HTTP) server that communicates with the CloudRAID client software. This bundle can also be replaced by other implementations. Possible replacements are a WebDAV interface or an SMB interface.
  • AmazonS3, Dropbox, SugarSync, and UbuntuOne are so called connectors. They provide specific implementations wrapping the regarding cloud storage APIs. By implementing an interface of the interfaces bundle they can export services to the core bundle. Core can then access the cloud storages on a unified way.
  • MiGBase64 (“Mikael Grev Base64”) is an open source (BSD license) and very high performing base64-encoder.17 Since the official implementation is not available as OSGi bundle the project was forked and slightly modified into an OSGi bundle.18

Actual server architecture

Figure 11: Actual server architecture. “A → B” means that A is statically imported by B. Since the RAID implementation is not a real bundle the line is dashed.

The different interface bundles were joined together to one single bundle to reduce the number of bundles. Since interfaces have in general a very small file size using a bundle for every interface would only cause a lot of – administration and file size – overhead.

The Core bundle does not implement the connection functionality to the cloud storages any more, but uses IStorageConnector services to do so (see next section). Additionally, the dependency between the RESTful API and the Core bundle was inverted: In the first design Core should load the API, but in the later design the RESTful API loads the Core bundle.

The configuration functionality was also excluded from the Core bundle and transferred to an own bundle for a higher flexibility.

Jersey and Jetty were replaced by javax.servlet to reduce the number of dependencies that have to be administered. The javax.servlet bundle is already shipped with the Equinox OSGi framework whilst Jersey and Jetty have to be downloaded, built, and installed.

3.3.1 Core

For a better understandability Figure 12 on page 21 shows how the Core bundle interacts with the other bundles. Since the architecture diagram above does only show hard dependencies this graphic explicitly shows soft dependencies.

The red rectangular boxes with italic text are Java interfaces from the Interfaces bundle. The Core bundle loads different services via the OSGi registry. One is a ICloudRAIDConfig service that is implemented by the Config bundle. The IMetadataManager service is provided by the MetadataManager bundle, the IPasswordManager service is implemented by the PasswordManager bundle. Important is that Core loads three IStorageConnector services. Regarding services are provided by the different storage connectors.

CloudRAID can use three times the same service (which is not the intention of CloudRAID). It can also use twice the same service and another service as the third one (which also annihilates the advantages of CloudRAID). The best setup is to use three different services (for example Dropbox, UbuntuOne, and AmazonS3). The services used can be controlled via the Config bundle.

The only package that is not indirectly loaded by Core via a service is the RESTful bundle. This bundle loads the Core bundle as a service. To be able to do so Core imple- ments the ICoreAccess interface from the Interfaces bundle. The dependency between those two bundles was implemented in this direction because a dependency in the other direction (Core loads RESTful) would make less sense. Core does not have to know the RESTful API or other APIs that provide access to Core, but RESTful needs to know the Core service. By using this kind of dependency it is possible to easily operate two or more APIs at the same time that access the same ICoreAccess service.

Core architecture

Figure 12: Dependencies around core bundle. Empty arrowhead =ˆ implementation of interfaces from interfaces bundle; dashed line =ˆ soft dependencies via OSGi service

3.3.2 Storage Connectors

Figure 13 on page 23 shows a more detailed view of the storage connectors respectively their dependencies since the diagrams above does not show all relevant soft dependencies between the bundles.

Every bundle providing an IStorageConnector service implements the regarding interface from the Interfaces bundle. The DropboxConnector, UbuntuOneConnector, and the AamazonS3Connector additionally require third level bundles. Every storage connector loads the Config service using the OSGi registry. Without the Config service the startup of the storage connectors would fail since the storage connectors need to know API access tokens, user names, and passwords to be able to log in – for security reasons and portability they should not be hard coded in the bundles.

As showed in listing 1 on page 22 every IStorageConnector service has to implement eight methods. The create() method is called first – it gets the configuration parameters and checks, if all prerequisites are fulfilled.

The connect() method executes the actual login to the cloud storage service and retrieves for example access tokens. The disconnect() method is not necessary for the currently supported cloud storages but is thought for services that support logging out from a service.

The upload() method creates a new file on a cloud storage, if and only if (iff) the file is not already on the cloud storage, while the update() method uploads a new file version, iff the file is already on the cloud storage.

get() returns an InputStream that reads a file from a cloud storage while getMetadata() reads file metadata. delete() removes a file from a cloud storage.

IStorageConnector interface

Listing 1: The IStorageConnector interface.

StorageConnector architecture

Figure 13: Dependencies of storage connectors. Empty arrowhead =ˆ implementation of IStorageConnector interface (includes hard dependency); solid line =ˆ hard dependency via import; dashed line =ˆ soft dependencies via OSGi service


14 from the server’s point of view

15 https://github.com/Markush2010/JSON-java

16 https://github.com/Markush2010/scribe-java

17 http://migbase64.sourceforge.net/

18 https://github.com/Markush2010/MiGBase64

Open Source Software

[CloudRAID] 2. Basics

During the last year a fellow student of mine, Markus Holtermann, and I wrote a student research paper about how to provide availability, redundancy and security of data in the overall existing and “well known” cloud. In this context we additionally developed a prototype that we call CloudRAID. The software is licensed under the terms of the Apache 2 License and published on github.

During the next weeks we are going to publish our paper as a series of posts on our blogs. This post’s predecessor is 1. Introduction. The next part of the student research paper is published by Markus on his blog and covers some interesting topics like the different types of RAID and encryption. You can find a table of contents (and links to the corresponding blog posts) on Markus’ blog.

2. Basics

2.1 The OSGi Framework

The Java programming language provides two models of modularity. One is the object/class modularity, the second is the modularity on package level. Within a program classes can be (re)used in different parts of the application. Packages can be used to give an application a structure for better maintainability.

There are also .jar-files. They are .zip-files with another file extension and contain a bunch of compiled Java classes. If a Java application uses classes of a library, the programmers have to make sure that this specific library is in the classpath variable on the user’s computer. If this is not the case, the application cannot be executed.

The OSGi framework is a specification to provide a modular software platform basing on the Java Virtual Machine (JVM). It provides via “bundles” modularity where the
modularity is higher than Java packages. [All12]

Every bundle can define dependencies to other bundles via “imports” and can provide programming interfaces to other bundles via “exports”.

Additionally, it is possible to define “services” and register such a service in the OSGi registry. Other bundles look in the registry for a specific service and use it. The advantage of this is that a bundle is not statically bound on a specific other bundle, but the bundle can be replaced dynamically by other service providers.

The differences between an OSGi bundle and a simple .jar-file are: [Kon09]

  • Metadata files in OSGi bundles (including versioning).
  • Limited visibility of classes in OSGi bundles.
  • Livecycle events in OSGi bundles, such as special events at activation or deactivation of a bundle.
  • Services as described above.
  • An OSGi runtime that executes the bundles’ code. OSGi is only the specification of such a runtime; implementations are the Equinox framework by the Eclipse Foundation3 or the Felix framework by the Apache Foundation4. There are a lot of other – also commercial and closed source – implementations of the OSGi specification.

The consequence for software developers is that they do not have to make sure, if a specific library is installed on the user’s computer. They just have to require a library that implements a specific functionality. This could for example mean that a software is no longer bound to the usage of a specific database, but the user can decide which database is used by installing the specific database OSGi bundle.

Hereinafter the term “hard dependency” is used for dependencies that use imports and exports while the term “soft dependency” is used for dependencies realized with OSGi services.

2.2 Cloud

Although many people use the term “cloud” in their daily lives there are different understandings of what it means. Often the term is used as a reference of an abstract place on the Internet where data is stored and processed.

This chapter is to give an understanding of the most common definitions of the “cloud” and will show issues regarding data-security, data-safety and data-availability.

2.2.1 Types of Cloud

The American National Institute of Standards and Technology (NIST) standardized some cloud-related terms to provide a common understanding when talking about “the cloud”. The usage of a cloud service can be categorized either by the “place” it is located on a computer network and the type of provider or by the types of services it provides. [MG11]

There are three terms indicating the different “places” where a cloud is located:

  • Public cloud
  • Private cloud
  • Hybrid cloud

When speaking about the types of services there are three common terms:

  • Infrastructure as a Service (IaaS)
  • Platform as a Service (PaaS)
  • Software as a Service (SaaS)

Public Cloud

A public cloud is provided by an external service provider via the Internet; it is therefore accessible for everyone who has an Internet connection (see also Figure 2 on page 6). The service provider has several customers that may share the same hardware resources but use different virtual environments. This is a potential security risk for the case that an attacker manages breaking out of its virtual environment into another virtual environment. It is also a risk for data-availability for the case someone manages taking down a whole hardware resource by accidentally or purposely immobilizing his or her virtual environment.

But especially for business customers a public cloud – as a form of outsourcing – provides also the opportunity to reduce software, hardware and personnel costs. If the cloud service customer hosted the services himself, he would need to pay for hardware, (maybe) software and for personnel that runs and maintains the data center. He also could not react very flexibly to load peaks during a peak time, for example Christmas time in the retail industry. For a highly specialized cloud service provider it is much easier to provide enough hardware capacity for peak times. For him buying new hardware might be less expensive caused by volume discounts or special contracts.

Private Cloud

A private cloud is located in a company’s intranet. It is therefore not accessible from the whole Internet but protected by the company’s firewalls and other access restrictions. The operator of a private cloud has also the full control over hardware and software resources. This is desirable especially for highly sensitive data as personal data or proprietary intellectual property.

The disadvantage of a private cloud are higher software, hardware and personnel costs. Running charges have also be kept in mind: A data center needs a building and consumes energy.

Hybrid Cloud

A hybrid cloud – as the name indicates – is a combination of a private and a public cloud. The intention of it is avoiding the disadvantages of public and private cloud and only having advantages of the hybrid cloud’s usage.

In the case of a hybrid cloud sensitive data is only processed in a private cloud while less sensitive or insensitive data is sent on demand to an external service provider. The main problem when implementing a hybrid cloud is the (possible) heterogeneity of the private cloud’s and public cloud’s platforms and software.

Public, private and hybrid cloud.

Figure 2: Graphical comparison of private, public, and hybrid cloud.

Infrastructure as a Service

means that the cloud service provider gives the customers the possibility to use storage space or computing time on demand. The customer may be charged for used storage space or CPU time. This student research project mainly bases on cloud storage services which can be classified as a form of IaaS.

Platform as a Service

means that the customer gets a development environment in the cloud where he or she can develop and run own applications.

Software as a Service

means that a customer can use software without buying required licenses. The licenses respectively the rights to use the software are sold temporarily (pay-per-use) and the software is running somewhere in the provider’s cloud. In most cases SaaS is provided in form of web applications using for example HTML, JavaScript, and AJAX.

The cloud gives great opportunities to the users of cloud services to reduce hard- and software costs and be able to handle performance peaks.

In the following the terms “cloud” and “cloud service” will be used as synonym for cloud storage solutions. CloudRAID will also deal with the disadvantages of these.

2.2.2 Information Privacy and Information Security

There is no common definition of the term “information privacy”. On the one hand it can mean that the data of a person or a company cannot be accessed by unauthorized persons during electronic data processing. On the other hand it can mean that a person can decide which company is for how long allowed to store his or her personal data. There are further similar definitions; here we will go primarily into the first mentioned meaning.

Information privacy plays an important role in Germany [Pri07]. There are strict statutory guidelines for governmental institutions as well as companies when and how which personal data can be used in electronic data processing; these rules self-evidently apply for the cloud, too. But also private individuals worry about the potential loss of personal data in the cloud caused by break-ins of criminal individuals (better known as “hackers”) into the provider’s systems.

Mainly for companies the question arises, if the advantages of the cloud justify the potential security risks – not only intellectual property of the company is in danger, there are also severe sentences for violations of the Bundesdatenschutzgesetz (BDSG, the German Federal Data Protection Act) as described in the next section Statutory Rules (Especially for Germany) on page 8.

A further problem is caused by the architecture of the Internet – it is not really relevant where the data is stored to have access to it from anywhere. But the place of a data center influences, if local authorities can demand access, respectively how high the legal barriers are for this.

The term “information privacy” should not be confused with the term “information security”. The latter describes guaranteeing data integrity, availability and privacy. Companies and private users who do not want to lose data by not being able to access it anymore or do not want to retrieve corrupted or manipulated data are mainly concerned about a high level of information security.

Statutory Rules (Especially for Germany)

In Germany there are different laws that govern the handling of companies and authorities with personal data. Besides several laws of the 16 German federal states the federal law BDSG (Federal Data Protection Act) mainly serves to

[…] den Einzelnen davor zu schützen, dass er durch den Umgang mit seinen personenbezogenen Daten in seinem  Persönlichkeitsrecht beeinträchtigt wird.”5

[…] protect the individual so that its personal rights are not harmed because of the processing of its personal data.6

Violations of the BDSG are sentenced according to § 43 BDSG with up to 50,000e (§ 43 I) or in more severe cases up to 300,000e (§ 43 II). The same paragraph also states that the punishment can exceed those sums, if it is appropriate for the individual case (§ 43 III).

The importance of the cloud but also the data privacy issues were realized by the European Commission, too. Early in 2012 EU Commissioner for justice Vivian Reding announced a reformation of the data privacy regulations at EU level [LB12]. In July of 2012 Neelie Kroes, EU Commissioner for the digital agenda, claimed more legal certainty for customers of cloud services. Additionally, it should be clarified who can be considered liable for loss or theft of personal data in the cloud [Kir12].

Standards and Recommendations

In Germany the information security is not as regulated as the information privacy. But there are standards of different organizations (such as ISO, EN, DIN etc.) and the IT-Grundschutz-Kataloge (IT Baseline Protection Catalogs) by the Bundesamt für Sicherheit in der Informationstechnik (BSI, Federal Office for Information Security).

The IT-Grundschutz-Kataloge describe the activities to be done to protect a company’s systems against attacks and outages. They do not represent a legal specification, but only a guideline that can voluntarily be followed. The BSI certificates basing on the IT-Grundschutz-Kataloge a company’s compliance after ISO/IEC 27001.

ISO/IEC 27001:2005 specifies the requirements for establishing, implementing, operating, monitoring, reviewing, maintaining and improving a documented Information Security Management System within the context of the organization’s overall business risks. [fS08]

In the Cloud

Because the Internet is an international interconnection of computers there are special conditions regarding statutory accesses of data and censorship. Additionally Services Level Agreements (SLAs) and security leaks are very important when evaluating the cloud.

Statutory Accesses and Censorship

Different countries respectively their governments fight a war against international terrorism. Especially since 9/11 also Western democracies have been increasing the efforts to establish control structures on the Internet to find potential terrorists and render them harmless. The USA are leading in this fight against terrorism. But a big number of well-known cloud storage providers has data centers or even the headquarters in the US. There is the danger that US-American authorities force cloud storage providers to deliver customer data to them. This also means that data might not be secure on their cloud storages.

This problem exists in other countries, too. Figure 3 on page 9 comparatively illustrates the information privacy regulations of the member states of the European Union and further eleven states (Dec. 31st , 2007). Dark red indicates bad conditions; yellow stands for better regulations; green indicates very good conditions. It has to be mentioned that this map does not judge the democratic situation in the regarding countries.

Privacy International 2007 privacy ranking map

Figure 3: Privacy International 2007 privacy ranking [Wü08]

Countries like China officially fight against terrorism, too. But the “Great Firewall of China” is indeed an instrument of censorship. One effect of this censorship infrastructure is that most cloud storage providers cannot directly be accessed from China. This also means that a cloud storage user might not be able to access his or her data in the cloud as long as he or she resides in China. Obviously, companies that have subsidiaries in China must be aware of the fact that their network traffic might be analyzed and monitored.

But also cloud storage providers search their storages for potentially illegal or unwanted contents (regarding their standard form contracts). As the German website netzpolitik.org7 reported Microsoft scans its cloud storage service SkyDrive for nude pictures, contact data of minors, weapons or other things. User accounts with suspicious activities get blocked and the users banned although the contents may not be illegal. [Mei12]

SLAs and Security Leaks

Cloud storage providers ensure for their customers – especially for their business customers – that their services fulfill certain requirements. This is called Services Level Agreement. In most cases an SLA defines that a cloud storage is available for a certain percentage of time. But one must be aware that an availability of 99 % means a time of unavailability of 3.65 days per year (which is ≈ 88 hours per year). Especially for business customers such a long outage can be cataclysmic.

But the SLAs also define a maximum recovery time. It says after how many hours or minutes a service will be again available after an outage.

Cloud computing and cloud storage provider Amazon showed in 2011 as well as 2012 that using the cloud does not mean failure safety and reliability. In April of 2011 the Amazon data center in Dublin, Ireland experienced a complete outage – although an availability of 99.9 % was guaranteed. Even some (few) data was permanently lost [Blo11]. In July of 2012 an Amazon data center in Virginia, USA experienced two outages within two weeks. Some well-known Internet sites had been unavailable for several hours [Mü12].

Cloud storage provider Dropbox8 showed in 2011 that unauthorized individuals can get access to customer data. For several hours it has been possible to login to arbitrary customer accounts using arbitrary passwords. It is possible that private data could be read and copied by unauthorized persons [Bac11].

2.2.3 Assure Data-security

As mentioned above there may be issues regarding the data-security when using cloud storage solutions. Security leaks can occur during the transmission or during the storage of data. In most cases the transmission procedure can be considered safe as long as HTTPS is used. But cloud storage provider may be target of hacker attacks or security services of the country the provider is located in force access to customers’ data.

Cloud storage providers often store files unencrypted for performance9 and storage space10 reasons. These services cannot be seen as suitable places to store private or sensitive data.

A customer of a cloud storage provider can overcome this weakness by using encrypted files ore file containers. This means for the user that he has to install a second software (besides the cloud storage provider’s) on his computer. For the average user this approach is too uncomfortable.

2.2.4 Assure Data-safety

In general cloud storages are designed in a way that data-safety is ensured via redundancy. Although there are cases when data got irrecoverably lost (see above).

For an average user this means backing his backed-up files up a second time on another location – either another cloud storage or at home on an external hard drive. This again means an uncomfortable overhead for him or her.

2.2.5 Assure Data-availability

Also the data-availability cannot be ensured. Multiple factors affect whether the data in the cloud can be accessed or not.

If the user’s Internet connection is broken he or she will not be able to access any file in the cloud. On the other side the provider’s Internet connection could be broken. In this case the user could access his files, if he stored them on another cloud storage.

2.2.6 Conclusion

This chapter showed that there are different issues regarding cloud storage solutions. The biggest problem is that private data is transferred to foreign servers – but their security state is unknown to the user.

A lot of people – potential customers of cloud services – have justified concerns about their privacy and therefore would not upload any files to these services.

This report will show one possible way to tackle the three main problems of cloud storage solutions as described above by using encryption and RAID technology.


3 http://www.eclipse.org/equinox/

4 https://felix.apache.org/site/index.html

5 § 1 Abs. 1 BDSG

6 Translation by the author

7 https://netzpolitik.org

8 https://dropbox.com

9 Encryption needs CPU time.

10 Two users may upload the same file. If they are not encrypted, the provider has to store it only once.


OSGiTM Alliance.  The OSGi Architecture, 2012.  http://www.osgi.org/Technology/WhatIsOSGi (accessed at 21. August, 2012).

Daniel Bachfeld. Dropbox akzeptierte vier Stunden lang beliebige Passwörter. Heise Online Newsticker, June 21, 2011. http://heise.de/-1264100.

Henry Blodget. Amazon’s Cloud Crash Disaster Permanently Destroyed Many Customers’ Data. Business Insider SAI Online, April 28, 2011. http://www.businessinsider.com/amazon-lost-data-2011-4.

International Organization for Standardization. ISO/IEC 27001:2005, October 15, 2008. http://www.iso.org/iso/catalogue_detail?csnumber=42103.

Christian Kirsch. Bericht: EU plant einheitliche Cloud-Regeln. Heise Online Newsticker, June 10, 2012. http://heise.de/-1635545.

Robert Konigsberg. The difference between a jar and a bundle. Blatherberg, April 22, 2009. http://konigsberg.blogspot.co.uk/2009/04/difference-between-jar-and-bundle.html.

Falk Lüke and Volker Briegleb. Reding stellt EU-Datenschutzreform vor. Heise Online Newsticker, January 25, 2012. http://heise.de/-1421418.

Andre Meister. SkyDrive: Microsoft durchsucht Nutzer-Daten in der Cloud nach AGB-Verletzungen und sperrt Accounts. Netzpolitik.org, July 20, 2012. https://netzpolitik.org/2012/skydrive-microsoft-durchsucht-nutzer-daten-in-der-cloud-nach-agb-verletzungen-und-sperrt-accounts/.

Peter Mell and Timothy Grance. The NIST Definition of Cloud Computing. Recommendations of the National Institute of Standards and Technology, September 2011. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf.

Florian Müssig. Weiterer Stromausfall in Amazons Cloud. Heise Online Newsticker, June 30, 2012. http://heise.de/-1629610.

Privacy International.  National Privacy Ranking 2007 – Leading Surveillance Societies Around the World, December 31, 2007. https://www.privacyinternational.org/sites/privacyinternational.org/files/file-downloads/phrcomp_sort_0.pdf.

Wüstling. Privacy International 2007 privacy ranking map, June 30, 2008. https://commons.wikimedia.org/w/index.php?title=File:Privacy_International_2007_privacy_ranking_map.png&oldid=61153048. CC-BY-SA 3.0 unported.