Almost a year ago, EMC has released a new line of midrange systems VNX2. In this article we will give a brief overview of these systems and talk about some of their features.
The migration to a new apparatus always requires software adaptation and optimization. So, for example, the migration to multi-core Intel Xeon E5 Sandy Bridge processors has leaded to significant adaptation of software (for block access level) which is used on the controllers and is now called MCx. It allows to use the advantages of the new processors more optimally, allowing more efficient allocation of tasks across multiple cores.
The name MCx comes from MultiCore X, in which X means different components (Cache, Fast Cache, RAID), working more efficiently now.
Thus, let’s consider the principal things:
Multicore Cache no longer uses the manual separation between the areas of the cache memory for RW operations. In this case, the cache operates in a more efficient mode automatically adjusting to the ever-changing profile of read / write. Cleaning the cache memory for write operations (write flushing) has become more intellectual. While previously it was possible to set the size of a cache memory block, now the logical block size is 64KB and the page size is 8KB. Thus, one logical block has 8 pages. The page size can be seen with the command «naviseccli cache -sp -info».
Multicore RAID is optimized for multi-core configuration. Besides some new features have appeared:
- Any unused disk in the system can become a disk of hot spare. And now you don’t need to copy data from hot spare disk when a new disk appears. This reduces the time of operations for hot swapping of disks.
- Mobility of drives – any disk can be moved to a different slot in the current or another storage shelf. If the disk is not be installed in the system within 5 minutes, the system determines that the failure has occurred and the system begins an operation of hot swap. If the disk is installed in the system in a different location in less than 5 minutes, then the system writes all data changes which occurred during this time. This option can be used to free up disk shelves or make performance balancing between the buses that connect disk shelves.
- Parallel rebuild RAID 6 – support simultaneous rearrangement of 2 disks in group of RAID 6. This option allows to quickly restore the group in case of failure of 2 disks.
Multicore Fast Cache
Fast Cache has been already used in the previous lines of VNX and CX4. In the new line this technology has developed further. The algorithms have been improved, it has become possible to distribute the load on more cores, as a result, it has allowed to increase the size of Fast Cache and make the system more powerful.
Here is the list of the main features of the Fast Cache:
- This is a global system resource.
- As the matter of fact, it is a mechanism that allows to make a hybrid cache, i.e. it allows to create an additional level of caching much faster than usual drives and slower than memory. Thus, the order of magnitude of the access time for Multicore Cache – nanoseconds, for Multicore Fast Cache – microseconds, for usual drives – milliseconds.
- It can be enabled for LUNs for usual RAID groups or entire pool when pools are used.
- In order to achieve a good result, it is required to accumulate statistics on data usage and the data itself in Multicore Fast Cache. This is so-called warming of Fast Cache.
- It uses logical blocks (chunks) which size is 64KB. Thus, the data blocks hit into the Fast Cache if their size is equal to or less than 64KB.
- The most frequently used data blocks hit into the Fast Cache. A special component (Policy engine) makes this process possible. It collects statistical information about the blocks of data and moves data depending on built-in policies. These policies cannot be changed.
- There is a bitmap (Memory map), which has relevant information about the status of the blocks and relevant addresses of these blocks of data. This map is located in the cache of the controllers and its copy is located into the Fast Cache too.
- Fast Cache is especially effective in case the host frequently requests or writes the same data blocks of small size not exceeding 64K. Thus, Fast Cache is very effective for certain parameters of load, for example, It’s perfect for OLTP tasks, OLTP databases, etc.
- It is not used for sequential read or write, in this case, the data is written to or read from disks directly.
- Fast Cache is not used for LUNs, created on the RAID groups on Flash drives, as well as it’s not used if one part of the LUN is located in the hybrid pool on the “Extreme performance” level, so this part of LUN is located on Flash drives too. But for other parts of LUN which are located on SAS and NL-SAS, Fast Cache can be used.
Now let’s describe what happens during write or read operation when MultiCore Fast Cache is used.
The scheme of reading data from a host is presented below:
- Host I/O request is received. Then Multicore Cache checks, whether the request can be serviced from memory of controllers.
- If the data is located in the memory of controllers, then the data is retrieved from it. The Multicore Read Cache Hit has occurred.
- If the data is not located in the memory of controllers (Multicore Cache Read Miss), then Memory Map is checked to see if the requested data is in Multicore Fast Cache or not.
- If the requested data is in Multicore Fast Cache (Multicore Fast Cache Hit), the data is read from it and replaced into Multicore Cache, then Multicore Cache completes the host I/O request.
- If the requested data is not contained in Multicore Fast Cache (Multicore Fast Cache Miss), the data is read from hard drives and placed into Multicore Fast Cache and then Multicore Cache completes the host I/O request.
- Policy engine promotes the data to Multicore Fast Cache and updates the Memory Map if needed.
In case of write operations two options are possible: with using controller cache on write (write cache enable) and without (write cache disable).
In the first case, the data is written to the cache controllers, and the response is sent to the host about the successful write. This happens regardless of whether the data is located in the Multicore Fast Cache or not.
if the cache memory on write is not used for a system or logical volume, the following scheme is used:
- Multicore Cache receives the write request (one block) from the host and the data should be stored until the response to the host that data was recorded successfully.
- Bitmap in the memory (Memory map) is checked in order to see, whether the requested page is located in Multicore Fast Cache.
- If yes, then it is updated in Multicore Fast Cache.
- If not, then the data is saved to ordinary hard disks
- Multicore Cache provides an answer to the host about the successful writing, if the data is updated either to Multicore Fast Cache or ordinary hard disks.
- Police engine copies the data in Multicore Fast Cache and updates the bitmap (Memory map), if the data are often used.
Configuration options of Multicore Fast Cache
In various models of VNX2 systems different quantity of Flash drive is supported.
In this case only SLC type of Flash is used. This type of Flash disk is better characteristics for the number of full cycles of rewriting, but they have a smaller volume compared to MLC disks, used in Fast VP hybrid pools. The currently available sizes of SLC Flash disks are 100GB and 200GB. If you want to get the maximum number of IOPS by using Multicore Fast Cache, and save a little, then use 100 GB drives. If there is no any special condition, it is better to use disks of maximum volume.
The table with the quantity of drives that are supported for the different models is represented below:
For more detailed information about how Cache works, see the document of EMC VNX Multicore Fast Cache VNX2 series detailed review.
Fully automated hybrid pools Fast VP
This technology allows the creation of pools of different disk types: Flash, SAS, NL-SAS and automatically move data between these types depending on the frequency of access to data. Thus, seldom-used data will reside on slow disks, and frequently used data – on faster disks. Using this approach, there is a good potential to reduce the cost of data storage.
Consider the main points when using Fast VP:
- One hybrid pool may have up to three storage tiers.
- The level of maximum performance only uses Flash drives, the usual level of performance – SAS drives, the level of capacity – NL-SAS drives. SAS. At the same time, SAS disks with various RPM 10K and 15K don’t differ.
- Different RAID group configurations are possible on each level. The set of optimal configurations exist (see. table below). There is a specific thing. For example, if earlier, you used RAID groups in RAID5 configuration 4 + 1 of 5 disks on the performance level, and urgently you have to expand this level, but you only have 4 free disks, the system will allow you to add new group of 4 disks in the configuration RAID5 3 + 1 to the current level, but it will say that the configuration is not recommended and is not optimal.
|group type||optimal configurations|
|RAID 5||4+1, 8+1|
|RAID 6||6+2, 14+2|
- The data in these groups is divided into blocks (slices) with 256MB granularity (in the previous generation VNX granularity was 1GB).
- In the hybrid pools data movement takes place between the levels, it is very important to make sure that the pool always has free space, used to move data. Recommendation: leave at least 5% of free space in the pool.
Moving data between storage tiers
The ability to move data is estimated on the basis of the load statistics at the level of these blocks. The averaged statistics which is used for estimation of possibility of moving data for all blocks of pool is formed at intervals of 1 hour for the each block (slice). Forming such statistics is not an easy task as it seems at first glance. For this purpose, the term “temperature” of the block (slice temperature) is introduced, i.e., the special function is determined which shows how often a particular block is used. This function is based on the raw statistics of the load on the block (number of IO, ratio of read / write, response time, etc). The raw data filtering is performed using an exponentially weighted moving average (EWMA) to smooth out short-term fluctuations and highlight the main trends.
The smoothed data is used in the calculation of the blocks “temperature”, also the normalization is produced because the performance parameters differ depending on the level of storage and historical data is considered in these calculations for prior periods with weights. The weights for the last 24 hours are reduced about 2 times. This is a general picture of the process, anyone who wants to dig a little deeper can read the patents of EMC Corporation № US8429346 B1 «Automated data relocation among storage tiers based on storage load» and № US8566483 B1 «Measuring data access activity», inventors: Xiangping Chen, Khang Can, Manish Madhukar, David Harvey, Dean Throop, Mark Ku.
The special table is formed on the basis of the obtained data, this table contains the “temperature” of blocks, their addresses and other parameters. Next, the blocks are sorted by “temperature” by a particular algorithm and the list of blocks is obtained with the priorities for migrating to different levels or within the same level, if there are less congested resources.
All the processes, described here, require certain CPU and memory resources, and are, no doubt, implemented optimally.
For LUNs in the hybrid pools certain policies exist which determine at what levels will initially be located data and in what proportions, as well as the rules for the movement blocks of data on the schedule.
The following policies are available:
- Highest Available Tier – sets the preferred level for the initial data placement and subsequent data movement to the extreme performance level based on the available space.
- Auto-Tier – sets the preferred level for the initial data placement to the optimized performance level and then relocates data based on performance statistics.
- Start High then Auto-Tier – sets the preferred level for the initial data placement to the extreme performance level based on the available space and then relocates data based on performance statistics.
- Lowest Available Tier – sets the preferred level for the initial data placement to the capacity performance level based on the available space and then relocates data within this level based on performance statistics.
- No Data Movement – the initial placement is saved from the previous policy, further movement of data is not permitted.
The scheduling options are evident; it is possible to determine how often the process of moving data will occur, the speed of moving data, the time and duration of work. One of important parameters is the speed of movement of data, because, the additional load on the controllers depends on it and the system must successfully handle the main tasks of providing hosts access to storage resources, regardless of moving data.
Thick and thin LUNs
It’s possible to create 2 types of logical volumes within the hybrid pool. The difference between them is significant. As far as thick luns are concerned the actual capacity is allocated at once as set of 256MB blocks. As far as thin luns are concerned the capacity is presented, but real disk space is not allocated. When you first create a thin lun from the pool, the size of additional space which is consumed from pool for metadata and a small write area is 1.75GB (the number of blocks in the initial allocation of thin volume may vary depending on software version). The blocks by 256MB are allocated for the subsequent writes, but not consistently. Metadata and normal data for thin luns are written with granularity at 8KB. The direct addressing of data is used for thick luns. The situation is more complicated with thin luns. As a result, we obtain a significant difference in performance. The situation becomes much better if thin luns are used in the pool, which has Flash drives. This allows to store metadata on a faster level that significantly affects the performance of the thin luns.
Thick luns have more predictable performance, because all blocks are allocated by 256MB during the creation of the luns. Both types of luns may be in the same hybrid pool.
Sometimes while using pools the following case may occur, when the heavy load on one set of luns degrades the performance of other volumes within the same pool. To avoid these things, you need to use QoS manager. With it, you can limit a performance of some luns in the pool. Recommendation: use separate pools for different types of load.
When should we use the classic, thick and thin luns?
The Main thing is to know the requirements of the application, to understand a profile of the load.
- When the saving and simplicity of administration is necessary and performance is not important.
- For non-critical systems, development and test environments.
- When compression and deduplication is necessary.
- This is the recommended type of volumes fitting most of the tasks.
- When the performance and simplicity of administration is necessary.
- When a large number of snapshots (VNX Snapshots) is necessary.
- When there are special requirement for performance.
- When more predictable performance is needed.
The block deduplication appeared in the new line, it allows optimizing data that have several copies in the system and which are rarely changed. Deduplication is supported for hybrid pools. Luns can only be thin. The special container in the pool is used for deduplicated luns in which the search of identical blocks is made. This is the post deduplication with a fixed block size. The size of block is 8KB.
Consider, how the search for identical blocks of data is implemented. In general, it is based on a comparison of hashes of data blocks. Hashes are calculated on the basis of the algorithm Murmurhash 64bit. It is now implemented in the standard library STL C ++. When you use function for calculating hashes almost unequivocal relations between hashes and data blocks appear. The table of hashes is built. A search of identical hashes is performed. If a match is found, then the bit-wise comparison of related data blocks is performed. This is necessary, because the cases occur where different blocks may correspond to the same hash (rarely). If the data matches, the addresses of copies are changed to address of single block, and the extra data is deleted.
However, this process is difficult to realize without additional tricks.
First, probably, the part of the structures for optimization of work with hashes is located not only in the pool, and also it is located in the memory of controllers and the memory is not unlimited.
Second, the larger the table, the more time is spent on the search for the identical hashes, this should have a negative impact on the response time.
And third, the blocks are changed often enough and it is necessary to remove the part of hashes in the table.
The idea of optimization is to speed up the process of comparison which performs not for all hashes of blocks but only for those that have been recently used. I.e., the table of hashes is used for a smaller amount of data. This mechanism is described in patent EMC № US8799601 B1 «Techniques for managing deduplication based on recently written extents» inventors: Xiangping Chen, Philippe Armangau.
Further details of the process are not exactly known. Perhaps, some equivalent of tree scheme for hashes is used. I.e. the set of hashes is formed, separate hash is calculated for this set, and so on, and thus, the search by hash tree will be faster. The optimization scheme is hardly exactly the same, we can only guess, but some basic ideas can be used.
The deduplication container is associated with a specific storage controller. The controller which owns the first lun with deduplication will manage the deduplication container. Manufacturer recommends that all luns with deduplication enabled within the same pool should be owned by the same storage controller which controls the deduplication container. It is recommended to evenly distribute deduplication containers between controllers.
The new process of deduplication runs in the background after 12 hours since the start of the previous process. Each pool is associated with its own independent process of deduplication. And deduplication is performed only within this pool, i.e., copies of data between different pools are not removed.
Not more than 3 deduplication processes can be run on the same storage controller. If it is time to start the new process, but 3 processes associated with other pools are still working on the controller, then the process will not start until the completion of one of these processes.
Metadata and updated data for the new 64GB are checked at the start of the deduplication process. If the container has not such amount of new data, the deduplication algorithm does not start and the timer of start is reset. If the condition is satisfied, the algorithms which were discussed earlier are run.
Any compression / decompression actively uses controllers’ CPUs and memory. Therefore, it can be used for cold data, but with caution. Lun transformation occurs from thick to thin when compression is enabled on the lun in the pool. Block size of compression is 64KB.
Data is written in uncompressed form, if the block of 64KB decreases by less than 8KB (1/8) with compression.
The following options are available for compression speed: low, medium, high.
Manufacturer recommends using a low speed.
Two types of snapshots are supported simultaneously:
- SnapView Snapshots – for luns created both in classic RAID groups and in the pools. This is a standard solution. The mechanism of copying on first write (COFW) is used. For this type of snapshots a special set of luns is required (These luns are located in the Reserve Luns Pool or RLP). When data changes on the source lun, the initial data is written to the luns of this set. Thus, the performance strongly depends on the configuration of RLP. Up to 8 snapshots with write are supported for one lun.
- VNX Snapshots are used only for luns created in the pools. The mechanism of redirect on write (ROW) is used. When a request to change the data on the source lun comes, the data is written to a different location in the pool, and the original data is not changed. On the one hand, less operations compared to Snapview snapshots are performed, but there is no clear separation of the data in the pool and an interference between snapshots and source volumes is possible. VNX snapshots are not removed very quickly, and if we have many snapshots and they are often created and deleted, it can cause additional load on the pool. Up to 256 snapshots are supported with write for one lun.
- Both mechanisms are quite efficient, but good planning of pools and RLP is required.
The workflows for Snapview and VNX snapshots for read and write operations are presented below:
More information about VNX snapshot can be found in the document of EMC VNX Snapshots.
Remote replication (block level)
It is possible to use several solutions for remote data replication at the block level between storage systems: MirrorView, RecoverPoint, SAN Copy.
In this review, we briefly describe the basic options only for the standard replication of MirrorView, as RecoverPoint and SAN Copy are rather broad themes.
Two types of replication MirrorView are supported:
- Synchronous replication (MirrorView / S) – provides RPO of few seconds. The distance between sites in optics without the use of technology of spectral multiplexing (DWDM) may be up to 60 km. The distance with using DWDM may be up to 200 km.
- Asynchronous replication (MirrorView / A) – provides RPO of few hours. The distance between sites may be thousands of miles.
Backend and FrontEnd
In parts of apparatus we want to note that the new systems use technology PCIE 3.0. Channel coding scheme is changed from 8bit / 10bit (PCIE specification version 2, redundancy ~ 20%) to 128 bit / 130bit (PCIE specification version 3, redundancy ~ 1.5%) and in conjunction with other improvements this led to good performance of backend and this gives a reserve for future use. So, PCIE version 3 allows working at speeds up to 40Gbps over Ethernet and up to 16Gbps over Fiber channel.
Balance between the CPU power of the Controllers and Backend
Storage vendors comply with the balance between the maximum performance of controllers and the maximum disk configuration (BackEnd) of the storage systems. We try to understand how the performance of processors and SAS storage controller chips is correlated with each other in EMC VNX2.
We need to have parameters which will characterize the performance of processors and SAS chips to consider this issue.
We will not complicate a task for ourselves and select the most simple parameters:
- We will use the sum frequency of cores of one controller for the estimation of processors performance.
- We will use the max quantity of SAS 6Gbps ports (4 lanes) for the estimation of SAS chips performance.
We take the number and frequency of processor cores and the number of SAS ports from the comparative table. Note that we are concerned with the total number of ports on SAS chips, but not the number of external SAS ports.
Each additional IO-module uses SAS chip with 4 ports. SAS IO-module is connected to the fabric PCIE v3 with 8 lanes, so the maximum bandwidth in one direction is 6,4 GBps.
We think that the same chips are used for embedded SAS ports on the controller due to unification. It follows that the drives of controller shelf are connected to the SAS chip thru 2 internal SAS ports 6Gbps. Thus, a controller shelf has a large reserve bandwidth. Therefore, there is a recommendation to install Flash drives here. It is not applicable to VNX8000 system, cos VNX8000 doesn’t have embedded SAS ports.
We add 2 SAS ports for other systems. As a result, we obtain the maximum number of ports on the SAS chips for one controller:
VNX5200 – 4
VNX5400 – 4
VNX5600 – 8
VNX5800 – 8
VNX7600 – 8
VNX8000 – 16
We can construct the following graphs, after simple calculations:
Let’s make some research:
Three configurations of Backend (SAS chips) correspond to 5 processor configurations. Thus, the processing power and Backend performance are scaled with different frequency.
This indicates that the reserve of performance of Backend (SAS chips) is used for some models.
The same configuration is used for SAS chips for models VNX5200 and VNX5400. Thus, VNX5200 has a good reserve of Backend bandwidth , but weaker processors.
The performance of the processor configuration of VNX5400 more closely matches the bandwidth of Backend (SAS chips).
Next, we see that the scaling of processor power more than twice leads to a twofold scaling Backend (8 SAS ports per controller) for VNX7600 (sum 17,6 GHz of the controller).
As a result, we have two more models of VNX5600 and VNX5800 for the same configuration of SAS chips, these models continue to scale linearly by the performance of processors ranging from VNX5200.
The scaling of processing power more than twice for VNX8000 (43,2 GHz) leads to the doubling of Backend performance (16 SAS ports per controller).
Note that the extra CPUs performance reserve is used, because the complication of systems leads to more overhead.
We will use this information to estimate the performance of controllers VNX2 later.
The controllers performance estimations, made on the basis of storage backend calculator can be found in the following articles:
- «IOPS Estimation for EMC VNX5200».
- «IOPS Estimation for EMC VNX5400».
- «IOPS Estimation for EMC VNX7600».
We finish our review with a small comparative table:
|Max drives per array||125||250||500||750||1000||1500|
|Memory per controller, GB||16||16||24||32||64||128|
|Memory per array, GB||32||32||48||64||128||256|
|CPU type||Xeon E5||Xeon E5||Xeon E5||Xeon E5||Xeon E5||Xeon E5|
|CPU per controller||1||1||1||1||1||2|
|CPU core q-ty||4||4||4||6||8||8|
|CPU core clock, GHz||1.2||1.8||2.4||2.0||2.2||2.7|
|Max SAS 6Gbps ports for DAE per controller (SP)||2||2||6||6||6||16|
|Max SAS 6Gbps ports for DAE per system||4||4||12||12||12||32|
|Max FC 8Gbps ports per controller (FE)||16||16||20||20||20||36|
|Max FC 8Gbps ports per array (FE)||32||32||40||40||40||72|
|Max iSCSI 10Gbps ports per controller (FE)||8||8||8||8||8||8|
|Max iSCSI 10Gbps ports per array (FE)||16||16||16||16||16||16|
|Max Disks per RAID Group||16||16||16||16||16||16|
|Max Disks per Pool||121||246||496||746||996||1496|