Search the Community

Showing results for tags 'ceph in pictures'.

Found 1 result

Sort By
- Date
- Relevancy

Getting to Know Ceph Storage in Pictures

Guy Fawkes posted a blog entry in Fox blog

Cloud file storage continues to gain popularity, and the requirements for them continue to grow. Modern systems are no longer able to fully satisfy all these requirements without significant resource expenditures on supporting and scaling these systems. By system, I mean a cluster with one or another level of access to data. For the user, storage reliability and high availability are important, so that files can always be easily and quickly retrieved, and the risk of data loss tends to zero. In turn, for providers and administrators of such storage, ease of support, scalability and low cost of hardware and software components are important. Meet Ceph Ceph is an open-source software-defined distributed file system devoid of bottlenecks and single points of failure, which is an easily scalable to petabyte sizes cluster of nodes performing various functions, providing data storage and replication, as well as load distribution, which guarantees high availability and reliability. The system is free, although the developers can provide paid support. No special equipment is required. If any disk, node or group of nodes fails, Ceph will not only ensure the safety of the data, but will also restore the lost copies on other nodes until the failed nodes or disks are replaced with working ones. The rebuild occurs without a second of downtime and is transparent to clients. Node roles and daemons Since the system is software-defined and runs on top of standard file systems and network layers, you can take a bunch of different servers, stuff them with different disks of different sizes, connect all this happiness with some network (preferably a fast one) and raise a cluster. You can stick a second network card into these servers and connect them with a second network to speed up interserver data exchange. And experiments with settings and schemes can be easily carried out even in a virtual environment. My experience of experiments shows that the longest part of this process is installing the OS. If we have three servers with disks and a configured network, then raising a working cluster with default settings will take 5-10 minutes (if everything is done correctly). Ceph daemons run on top of the operating system, performing various roles of the cluster. Thus, one server can act, for example, as a monitor (MON) and as a data storage (OSD). Meanwhile, another server can act as a data storage and as a metadata server (MDS). In large clusters, daemons are launched on separate machines, but in small clusters, where the number of servers is very limited, some servers can perform two or three roles at once. Depends on the server capacity and the roles themselves. Of course, everything will work faster on separate servers, but this is not always possible to implement. A cluster can be assembled even from one machine and only one disk, and it will work. Another conversation is that it will not make sense. It should also be noted that due to software definition, storage can be raised even on top of a RAID or iSCSI device, but in most cases this will not make sense either. The documentation lists 3 types of daemons: Mon — monitor daemon OSD — storage daemon MDS — metadata server (only required if using CephFS) The initial cluster can be created from several machines, combining cluster roles on them. Then, as the cluster grows and new servers are added, some roles can be duplicated on other machines or completely moved to separate servers. Storage structure To begin with, briefly and unclearly. A cluster can have one or many data pools of different purposes and with different settings. Pools are divided into placement groups. Placement groups store objects that clients access. This is where the logical level ends and the physical level begins, because each placement group is assigned one main disk and several replica disks (how many exactly depends on the pool replication factor). In other words, at the logical level, an object is stored in a specific placement group, and at the physical level, on the disks that are assigned to it. In this case, the disks can be physically located on different nodes or even in different data centers. Further in detail & clear. Replication factor (RF) The replication factor is the level of data redundancy. The number of copies of data that will be stored on different disks. The size variable is responsible for this parameter. The replication factor can be different for each pool, and it can be changed on the fly. In general, in Ceph, almost all parameters can be changed on the fly, instantly receiving a cluster response. At first, we can have size=2, and in this case, the pool will store two copies of one piece of data on different disks. This pool parameter can be changed to size=3, and at the same time, the cluster will begin to redistribute data, spreading another copy of the existing data across disks without stopping the work of clients. Pool A pool is a logical abstract container for organizing the storage of user data. Any data is stored in the pool as objects. Several pools can be spread across the same disks (or different ones, depending on how to configure) using different sets of placement groups. Each pool has a number of configurable parameters: replication factor, number of placement groups, minimum number of live object replicas required for operation, etc. Each pool can have its own replication policy (by city, data center, rack, or even disk). For example, a hosting pool can have a replication factor of size=3, and the failure zone will be data centers. Then Ceph will guarantee that each piece of data has one copy in three data centers. Meanwhile, a pool for virtual machines can have a replication factor of size=2, and the failure level will be a server rack. And in this case, the cluster will store only two copies. At the same time, if we have two racks with virtual image storage in one data center, and two racks in another, the system will not pay attention to the data centers, and both copies of the data can fly to one data center, but guaranteed to different racks, as we wanted. Placement group (PG) Placement groups are a link between the physical storage level (disks) and the logical organization of data (pools). Each object at the logical level is stored in a specific placement group. At the physical level, it is stored in the required number of copies on different physical disks that are included in this placement group (in fact, not disks, but OSDs, but usually one OSD is one disk, and for simplicity I will call it a disk, although I remind you that there can be a RAID array or an iSCSI device behind it). With a replication factor of size=3, each placement group includes three disks. But at the same time, each disk is in many placement groups, and for some groups it will be primary, for others - a replica. If an OSD is, for example, part of three placement groups, then when such an OSD fails, the placement groups will exclude it from operation, and each placement group will select a working OSD in its place and spread the data across it. This mechanism achieves a fairly uniform distribution of data and load. This is a very simple and flexible solution at the same time. Monitors A monitor is a daemon that acts as a coordinator, from which a cluster starts. As soon as we have at least one working monitor, we have a Ceph cluster. The monitor stores information about the health and state of the cluster, exchanging various maps with other monitors. Clients access monitors to find out which OSDs to write/read data to. When deploying a new storage, the first thing to do is create a monitor (or several). A cluster can live on one monitor, but it is recommended to make 3 or 5 monitors, in order to avoid the entire system crashing due to the failure of a single monitor. The main thing is that their number is odd, in order to avoid split-brain situations. Monitors work in a quorum, so if more than half of the monitors fail, the cluster will lock to prevent data inconsistency. OSD (Object Storage Device) OSD is a storage unit that stores the data itself and processes client requests, exchanging data with other OSDs. Usually this is a disk. And usually a separate OSD daemon is responsible for each OSD, which can be launched on any machine where this disk is installed. This is the second thing that needs to be added to the cluster when deploying. One monitor and one OSD is the minimum set to raise the cluster and start using it. If the server has 12 disks for storage, then the same number of OSD daemons will be launched on it. Clients work directly with the OSDs themselves, bypassing bottlenecks and thereby achieving load distribution. The client always writes an object to the primary OSD for some placement group, and then this OSD synchronizes the data with the rest (secondary) OSDs from the same placement group. Confirmation of a successful write can be sent to the client immediately after writing to the primary OSD, or after reaching the minimum number of records (the min_size pool parameter). For example, if the replication factor size=3 and min_size=2, then a confirmation of successful write will be sent to the client when the object is written to at least two OSDs out of three (including the primary). With different configuration options for these parameters, we will also observe different behavior. If size=3 and min_size=2: everything will be fine while 2 out of 3 OSDs of the placement group are alive. When only 1 OSD is alive, the cluster will freeze the operations of this placement group until at least one more OSD comes to life. If size=min_size, then the placement group will be blocked when any OSD in it crashes. And due to the high level of data smearing, most crashes of at least one OSD will end with the entire or almost entire cluster freezing. Therefore, the size parameter should always be at least one point larger than the min_size parameter. If size=1, the cluster will work, but the death of any OSD will mean irreversible data loss. Ceph allows you to set this parameter to one, but even if the administrator does it for a specific purpose for a short time, he takes the risk. The OSD disk consists of two parts: the journal and the data itself. Accordingly, the data is first written to the journal, and then to the data section. On the one hand, this provides additional reliability and some optimization, and on the other hand, it is an additional operation that affects performance. The issue of journal performance will be discussed below. CRUSH Algorithm The decentralization and distribution mechanism is based on the so-called CRUSH algorithm (Controlled Replicated Under Scalable Hashing), which plays an important role in the system architecture. This algorithm allows you to uniquely determine the location of an object based on the hash of the object name and a specific map, which is formed based on the physical and logical structures of the cluster (data centers, halls, rows, racks, nodes, disks). The map does not include information about the location of the data. Each client determines the path to the data himself, using the CRUSH algorithm and the current map, which he asks the monitor in advance. When a disk is added or a server crashes, the map is updated. Due to determinism, two different clients will find the same unambiguous path to one object independently, freeing the system from the need to keep all these paths on some servers, synchronizing them with each other, giving a huge excess load on the storage as a whole. Example: The client wants to write a certain object object1 to the pool Pool1. To do this, he looks at the placement group map, which the monitor kindly provided him earlier, and sees that Pool1 is divided into 10 placement groups. Then, using the CRUSH algorithm, which takes the object name and the total number of placement groups in Pool1 as input, the placement group ID is calculated. Following the map, the client understands that three OSDs are assigned to this placement group (let's say their numbers are: 17, 9 and 22), the first of which is primary, which means the client will write to it. By the way, there are three of them, because the replication factor size=3 is set in this pool. After successfully writing the object to OSD_17, the client's work is finished (this is if the pool parameter min_size=1), and OSD_17 replicates this object to OSD_9 and OSD_22 assigned to this placement group. It is important to understand that this is a simplified explanation of how the algorithm works. By default, our CRUSH map is flat, all nodes are in the same space. However, this plane can be easily turned into a tree by distributing servers by racks, racks by rows, rows by halls, halls by data centers, and data centers by different cities and planets, specifying which level is considered a failure zone. Operating with such a new map, Ceph will distribute data more intelligently, taking into account the individual characteristics of the organization, preventing the tragic consequences of a fire in a data center or a meteorite falling on an entire city. Moreover, thanks to this flexible mechanism, you can create additional layers, both at the upper levels (data centers and cities) and at the lower ones (for example, additional division into disk groups within a single server). Caching Ceph provides several ways to increase cluster performance using caching methods. Primary-Affinity Each OSD has several weights, and one of them is responsible for which OSD in the placement group will be primary. And, as we found out earlier, the client writes data to the primary OSD. So, you can add a bunch of SSD disks to the cluster, making them always primary, reducing the weight of the primary-affinity HDD disks to zero. And then the recording will always be performed first on a fast disk, and then slowly replicated to the slow ones. This method is the most incorrect, but the easiest to implement. The main drawback is that one copy of the data will always be on the SSD and a lot of such disks will be required to fully cover the replication. Although someone has used this method in practice, I rather mentioned it to talk about the possibility of managing the write priority. Moving logs to SSD In general, the lion's share of performance depends on the OSD logs. When writing, the daemon first writes data to the log, and then to the storage itself. This is always true, except for cases of using BTRFS as a file system on the OSD, which can do this in parallel thanks to the copy-on-write technique, but I still do not understand how ready it is for industrial use. Each OSD has its own journal, and by default it is located on the same disk as the data itself. However, journals from four or five disks can be moved to one SSD, significantly accelerating write operations. The method is not very flexible and convenient, but quite simple. The disadvantage of the method is that if the SSD with the journal crashes, we will lose several OSDs at once, which is not very pleasant and introduces additional difficulties into all further support, which scales with cluster growth. Cache-tiering The orthodoxy of this method is in its flexibility and scalability. The scheme is such that we have a pool with cold data and a pool with hot data. When an object is frequently accessed, it heats up and ends up in a hot pool, which consists of fast SSDs. Then, if the object cools down, it ends up in a cold pool with slow HDDs. This scheme makes it easy to change SSDs in a hot pool, which in turn can be of any size, because the heating and cooling parameters are adjustable. From the client's perspective Ceph provides the client with different options for accessing data: block device, file system or object storage. Block device (RBD, Rados Block Device) Ceph allows you to create an RBD block device in the data pool, and then mount it on operating systems that support it (at the time of writing, there were only various Linux distributions, but FreeBSD and VMWare also work in this direction). If the client does not support RBD (for example, Windows), then you can use an intermediate iSCSI target with RBD support (for example, tgt-rbd). In addition, such a block device supports snapshots. CephFS file system A client can mount a CephFS file system if it has Linux with kernel version 2.6.34 or newer. If the kernel version is older, it can be mounted via FUSE (Filesystem in User Space). In order for clients to be able to connect Ceph as a file system, it is necessary to raise at least one metadata server (MDS) in the cluster Object gateway Using the RGW (RADOS Gateway), you can give clients the ability to use storage via a RESTful Amazon S3 or OpenStack Swift compatible API. And others... All these data access layers work on top of the RADOS layer. The list can be supplemented by developing your own data access layer using the librados API (through which the above access layers work). At the moment, there are bindings for C, Python, Ruby, Java, and PHP RADOS (Reliable Autonomic Distributed Object Store), in a nutshell, is a layer for interaction between clients and the cluster. Wikipedia says that Ceph itself is written in C++ and Python, and Canonical, CERN, Cisco, Fujitsu, Intel, Red Hat, SanDisk, and SUSE are participating in the development. Impressions Why did I write all this and draw pictures? Because despite all these advantages, Ceph is either not very popular, or people eat it up quietly, judging by the amount of information about it on the Internet. We have found out that Ceph is flexible, simple and convenient. A cluster can be set up on any hardware in a regular network, spending a minimum of time and effort, while Ceph itself will take care of the safety of the data, taking the necessary measures in case of hardware failures. Many points of view agree that Ceph is flexible, simple and scalable. However, reviews of performance are quite varied. Perhaps someone could not cope with the logs, someone was let down by the network and delays in I/O operations. That is, making a cluster work is easy, but making it work quickly is perhaps more difficult. Therefore, I appeal to IT specialists who have experience using Ceph in production. Share your negative impressions in the comments.
- January 22
- - ceph
  - ceph storage
  - (and 1 more)
    Tagged with:
    
    ceph
    
    ceph storage
    
    ceph in pictures

Sign In

Search the Community

Search By Tags

Search By Author

Content Type

Forums

Blogs

Calendars

Categories

Categories

Categories

Categories

Product Groups

Categories

Find results in...

Find results that contain...

Date Created

Start

End

Last Updated

Start

End

Filter by number of...

Minimum number of comments

Minimum number of replies

Minimum number of reviews

Minimum number of views

Joined

Start

End

Group

About Me

Getting to Know Ceph Storage in Pictures

Browse

Activity

Store

Support

Videos

Matrix