[CloudRAID] 2. Basics

During the last year a fellow student of mine, Markus Holtermann, and I wrote a student research paper about how to provide availability, redundancy and security of data in the overall existing and “well known” cloud. In this context we additionally developed a prototype that we call CloudRAID. The software is licensed under the terms of the Apache 2 License and published on github. During the next weeks we are going to publish our paper as a series of posts on our blogs. This post’s predecessor is 1. Introduction. The next part of the student research paper is published by Markus on his blog and covers some interesting topics like the different types of RAID and encryption. You can find a table of contents (and links to the corresponding blog posts) on Markus’ blog.

2. Basics

2.1 The OSGi Framework

The Java programming language provides two models of modularity. One is the object/class modularity, the second is the modularity on package level. Within a program classes can be (re)used in different parts of the application. Packages can be used to give an application a structure for better maintainability. There are also .jar-files. They are .zip-files with another file extension and contain a bunch of compiled Java classes. If a Java application uses classes of a library, the programmers have to make sure that this specific library is in the classpath variable on the user’s computer. If this is not the case, the application cannot be executed. The OSGi framework is a specification to provide a modular software platform basing on the Java Virtual Machine (JVM). It provides via “bundles” modularity where the modularity is higher than Java packages. [All12] Every bundle can define dependencies to other bundles via “imports” and can provide programming interfaces to other bundles via “exports”. Additionally, it is possible to define “services” and register such a service in the OSGi registry. Other bundles look in the registry for a specific service and use it. The advantage of this is that a bundle is not statically bound on a specific other bundle, but the bundle can be replaced dynamically by other service providers. The differences between an OSGi bundle and a simple .jar-file are: [Kon09]

  • Metadata files in OSGi bundles (including versioning).
  • Limited visibility of classes in OSGi bundles.
  • Livecycle events in OSGi bundles, such as special events at activation or deactivation of a bundle.
  • Services as described above.
  • An OSGi runtime that executes the bundles’ code. OSGi is only the specification of such a runtime; implementations are the Equinox framework by the Eclipse Foundation3 or the Felix framework by the Apache Foundation4. There are a lot of other – also commercial and closed source – implementations of the OSGi specification.

The consequence for software developers is that they do not have to make sure, if a specific library is installed on the user’s computer. They just have to require a library that implements a specific functionality. This could for example mean that a software is no longer bound to the usage of a specific database, but the user can decide which database is used by installing the specific database OSGi bundle. Hereinafter the term “hard dependency” is used for dependencies that use imports and exports while the term “soft dependency” is used for dependencies realized with OSGi services.

2.2 Cloud

Although many people use the term “cloud” in their daily lives there are different understandings of what it means. Often the term is used as a reference of an abstract place on the Internet where data is stored and processed. This chapter is to give an understanding of the most common definitions of the “cloud” and will show issues regarding data-security, data-safety and data-availability.

2.2.1 Types of Cloud

The American National Institute of Standards and Technology (NIST) standardized some cloud-related terms to provide a common understanding when talking about “the cloud”. The usage of a cloud service can be categorized either by the “place” it is located on a computer network and the type of provider or by the types of services it provides. [MG11] There are three terms indicating the different “places” where a cloud is located:

  • Public cloud
  • Private cloud
  • Hybrid cloud

When speaking about the types of services there are three common terms:

  • Infrastructure as a Service (IaaS)
  • Platform as a Service (PaaS)
  • Software as a Service (SaaS)

Public Cloud

A public cloud is provided by an external service provider via the Internet; it is therefore accessible for everyone who has an Internet connection (see also Figure 2 on page 6). The service provider has several customers that may share the same hardware resources but use different virtual environments. This is a potential security risk for the case that an attacker manages breaking out of its virtual environment into another virtual environment. It is also a risk for data-availability for the case someone manages taking down a whole hardware resource by accidentally or purposely immobilizing his or her virtual environment. But especially for business customers a public cloud – as a form of outsourcing – provides also the opportunity to reduce software, hardware and personnel costs. If the cloud service customer hosted the services himself, he would need to pay for hardware, (maybe) software and for personnel that runs and maintains the data center. He also could not react very flexibly to load peaks during a peak time, for example Christmas time in the retail industry. For a highly specialized cloud service provider it is much easier to provide enough hardware capacity for peak times. For him buying new hardware might be less expensive caused by volume discounts or special contracts.

Private Cloud

A private cloud is located in a company’s intranet. It is therefore not accessible from the whole Internet but protected by the company’s firewalls and other access restrictions. The operator of a private cloud has also the full control over hardware and software resources. This is desirable especially for highly sensitive data as personal data or proprietary intellectual property. The disadvantage of a private cloud are higher software, hardware and personnel costs. Running charges have also be kept in mind: A data center needs a building and consumes energy.

Hybrid Cloud

A hybrid cloud – as the name indicates – is a combination of a private and a public cloud. The intention of it is avoiding the disadvantages of public and private cloud and only having advantages of the hybrid cloud’s usage. In the case of a hybrid cloud sensitive data is only processed in a private cloud while less sensitive or insensitive data is sent on demand to an external service provider. The main problem when implementing a hybrid cloud is the (possible) heterogeneity of the private cloud’s and public cloud’s platforms and software.

Public, private and hybrid cloud. Figure 2: Graphical comparison of private, public, and hybrid cloud.

Infrastructure as a Service

means that the cloud service provider gives the customers the possibility to use storage space or computing time on demand. The customer may be charged for used storage space or CPU time. This student research project mainly bases on cloud storage services which can be classified as a form of IaaS.

Platform as a Service

means that the customer gets a development environment in the cloud where he or she can develop and run own applications.

Software as a Service

means that a customer can use software without buying required licenses. The licenses respectively the rights to use the software are sold temporarily (pay-per-use) and the software is running somewhere in the provider’s cloud. In most cases SaaS is provided in form of web applications using for example HTML, JavaScript, and AJAX. The cloud gives great opportunities to the users of cloud services to reduce hard- and software costs and be able to handle performance peaks. In the following the terms “cloud” and “cloud service” will be used as synonym for cloud storage solutions. CloudRAID will also deal with the disadvantages of these.

2.2.2 Information Privacy and Information Security

There is no common definition of the term “information privacy”. On the one hand it can mean that the data of a person or a company cannot be accessed by unauthorized persons during electronic data processing. On the other hand it can mean that a person can decide which company is for how long allowed to store his or her personal data. There are further similar definitions; here we will go primarily into the first mentioned meaning. Information privacy plays an important role in Germany [Pri07]. There are strict statutory guidelines for governmental institutions as well as companies when and how which personal data can be used in electronic data processing; these rules self-evidently apply for the cloud, too. But also private individuals worry about the potential loss of personal data in the cloud caused by break-ins of criminal individuals (better known as “hackers”) into the provider’s systems. Mainly for companies the question arises, if the advantages of the cloud justify the potential security risks – not only intellectual property of the company is in danger, there are also severe sentences for violations of the Bundesdatenschutzgesetz (BDSG, the German Federal Data Protection Act) as described in the next section Statutory Rules (Especially for Germany) on page 8. A further problem is caused by the architecture of the Internet – it is not really relevant where the data is stored to have access to it from anywhere. But the place of a data center influences, if local authorities can demand access, respectively how high the legal barriers are for this. The term “information privacy” should not be confused with the term “information security”. The latter describes guaranteeing data integrity, availability and privacy. Companies and private users who do not want to lose data by not being able to access it anymore or do not want to retrieve corrupted or manipulated data are mainly concerned about a high level of information security.

Statutory Rules (Especially for Germany)

In Germany there are different laws that govern the handling of companies and authorities with personal data. Besides several laws of the 16 German federal states the federal law BDSG (Federal Data Protection Act) mainly serves to

[…] den Einzelnen davor zu schützen, dass er durch den Umgang mit seinen personenbezogenen Daten in seinem  Persönlichkeitsrecht beeinträchtigt wird.”5

[…] protect the individual so that its personal rights are not harmed because of the processing of its personal data.6

Violations of the BDSG are sentenced according to § 43 BDSG with up to 50,000e (§ 43 I) or in more severe cases up to 300,000e (§ 43 II). The same paragraph also states that the punishment can exceed those sums, if it is appropriate for the individual case (§ 43 III). The importance of the cloud but also the data privacy issues were realized by the European Commission, too. Early in 2012 EU Commissioner for justice Vivian Reding announced a reformation of the data privacy regulations at EU level [LB12]. In July of 2012 Neelie Kroes, EU Commissioner for the digital agenda, claimed more legal certainty for customers of cloud services. Additionally, it should be clarified who can be considered liable for loss or theft of personal data in the cloud [Kir12].

Standards and Recommendations

In Germany the information security is not as regulated as the information privacy. But there are standards of different organizations (such as ISO, EN, DIN etc.) and the IT-Grundschutz-Kataloge (IT Baseline Protection Catalogs) by the Bundesamt für Sicherheit in der Informationstechnik (BSI, Federal Office for Information Security). The IT-Grundschutz-Kataloge describe the activities to be done to protect a company’s systems against attacks and outages. They do not represent a legal specification, but only a guideline that can voluntarily be followed. The BSI certificates basing on the IT-Grundschutz-Kataloge a company’s compliance after ISO/IEC 27001.

ISO/IEC 27001:2005 specifies the requirements for establishing, implementing, operating, monitoring, reviewing, maintaining and improving a documented Information Security Management System within the context of the organization’s overall business risks. [fS08]

In the Cloud

Because the Internet is an international interconnection of computers there are special conditions regarding statutory accesses of data and censorship. Additionally Services Level Agreements (SLAs) and security leaks are very important when evaluating the cloud.

Statutory Accesses and Censorship

Different countries respectively their governments fight a war against international terrorism. Especially since 9/11 also Western democracies have been increasing the efforts to establish control structures on the Internet to find potential terrorists and render them harmless. The USA are leading in this fight against terrorism. But a big number of well-known cloud storage providers has data centers or even the headquarters in the US. There is the danger that US-American authorities force cloud storage providers to deliver customer data to them. This also means that data might not be secure on their cloud storages. This problem exists in other countries, too. Figure 3 on page 9 comparatively illustrates the information privacy regulations of the member states of the European Union and further eleven states (Dec. 31st , 2007). Dark red indicates bad conditions; yellow stands for better regulations; green indicates very good conditions. It has to be mentioned that this map does not judge the democratic situation in the regarding countries.

Privacy International 2007 privacy ranking map Figure 3: Privacy International 2007 privacy ranking [Wü08]

Countries like China officially fight against terrorism, too. But the “Great Firewall of China” is indeed an instrument of censorship. One effect of this censorship infrastructure is that most cloud storage providers cannot directly be accessed from China. This also means that a cloud storage user might not be able to access his or her data in the cloud as long as he or she resides in China. Obviously, companies that have subsidiaries in China must be aware of the fact that their network traffic might be analyzed and monitored. But also cloud storage providers search their storages for potentially illegal or unwanted contents (regarding their standard form contracts). As the German website netzpolitik.org7 reported Microsoft scans its cloud storage service SkyDrive for nude pictures, contact data of minors, weapons or other things. User accounts with suspicious activities get blocked and the users banned although the contents may not be illegal. [Mei12]

SLAs and Security Leaks

Cloud storage providers ensure for their customers – especially for their business customers – that their services fulfill certain requirements. This is called Services Level Agreement. In most cases an SLA defines that a cloud storage is available for a certain percentage of time. But one must be aware that an availability of 99 % means a time of unavailability of 3.65 days per year (which is ≈ 88 hours per year). Especially for business customers such a long outage can be cataclysmic. But the SLAs also define a maximum recovery time. It says after how many hours or minutes a service will be again available after an outage. Cloud computing and cloud storage provider Amazon showed in 2011 as well as 2012 that using the cloud does not mean failure safety and reliability. In April of 2011 the Amazon data center in Dublin, Ireland experienced a complete outage – although an availability of 99.9 % was guaranteed. Even some (few) data was permanently lost [Blo11]. In July of 2012 an Amazon data center in Virginia, USA experienced two outages within two weeks. Some well-known Internet sites had been unavailable for several hours [Mü12]. Cloud storage provider Dropbox8 showed in 2011 that unauthorized individuals can get access to customer data. For several hours it has been possible to login to arbitrary customer accounts using arbitrary passwords. It is possible that private data could be read and copied by unauthorized persons [Bac11].

2.2.3 Assure Data-security

As mentioned above there may be issues regarding the data-security when using cloud storage solutions. Security leaks can occur during the transmission or during the storage of data. In most cases the transmission procedure can be considered safe as long as HTTPS is used. But cloud storage provider may be target of hacker attacks or security services of the country the provider is located in force access to customers’ data. Cloud storage providers often store files unencrypted for performance9 and storage space10 reasons. These services cannot be seen as suitable places to store private or sensitive data. A customer of a cloud storage provider can overcome this weakness by using encrypted files ore file containers. This means for the user that he has to install a second software (besides the cloud storage provider’s) on his computer. For the average user this approach is too uncomfortable.

2.2.4 Assure Data-safety

In general cloud storages are designed in a way that data-safety is ensured via redundancy. Although there are cases when data got irrecoverably lost (see above). For an average user this means backing his backed-up files up a second time on another location – either another cloud storage or at home on an external hard drive. This again means an uncomfortable overhead for him or her.

2.2.5 Assure Data-availability

Also the data-availability cannot be ensured. Multiple factors affect whether the data in the cloud can be accessed or not. If the user’s Internet connection is broken he or she will not be able to access any file in the cloud. On the other side the provider’s Internet connection could be broken. In this case the user could access his files, if he stored them on another cloud storage.

2.2.6 Conclusion

This chapter showed that there are different issues regarding cloud storage solutions. The biggest problem is that private data is transferred to foreign servers – but their security state is unknown to the user. A lot of people – potential customers of cloud services – have justified concerns about their privacy and therefore would not upload any files to these services. This report will show one possible way to tackle the three main problems of cloud storage solutions as described above by using encryption and RAID technology.