AI & HPC Data Centers
Fault Tolerant Solutions
Integrated Memory
Memory module durability is enhanced through various hardware and software strategies designed to overcome the limitations of certain memory types, such as the limited write endurance of some non-volatile memory (NVM) technologies.
Thermal throttling is a safety feature if a drive gets too hot. Thermal sensors monitor the operating temperature of critical components. Firmware thermal throttling algorithm is activated when necessary to prevent the temperature from exceeding maximum thresholds by reducing the performance of I/O transactions until the operating temperature declines to a safe level.
When SSD drives take on heavy workloads, or NVMe SSDs made with high frequency processors consume more power, the SSD drives would generate a load of heat. Overheating could possibly lead to electronic component damages and system failures.
SMART’s SSD drives are built in with a thermal sensor to monitor the temperature of the SSD, and are throttled the overall performance once the temperature of the SSD exceeds the threshold. SMART’s Thermal Throttling architecture offers a fine algorithm to optimize the balance of safety and performance.
Pseudo single-level cell (pSLC) is the new technology of using multi-level cell (MLC) or triple-level cell (TLC) NAND Flash in a way that reduces the number of bits stored in each cell to one. Reducing the amount of stored bits in each cell to one increases the reliability and lifetime of the NAND Flash memory.
Many write-intensive applications, such as video surveillance, machine learning and HPC, require high program and erase cycles. SLC NAND Flash offers greater endurance and performance over other NAND Flash types, but at a higher cost per bit.
As NAND technology becomes cost-driven per gigabyte, 3D TLC NAND provides an excellent balance in cost and product lifetime. SMART’s pseudo-SLC (pSLC) products deliver improved endurance and performance in applications that requires higher level of data integrity and expected product lifetime. By implementing NAND vendor-specific commands and algorithms in the Flash controller firmware, 3D TLC NAND can be operated in 1-bit pSLC mode where endurance and performance behaviors are in line with native SLC NAND. SMART’s pSLC technology is designed to optimize the balance between cost and performance.
Wear-Leveling refers to the practice of ensuring certain NAND blocks aren’t written and erased more often than others. By preventing the overuse of particular blocks which could lead to device failure or data loss, Wear-Leveling therefore improves the life expectancy and endurance of Flash products.
In the application of NAND Flash, there is a limitation of program/erase cycle (P/E cycle). When the P/E cycles of each block reach the maximum value, these blocks become non-workable. If certain blocks are overly written and erased, the P/E cycles of these blocks will be consumed rapidly, causing NAND Flash to fail early.
The principle of Wear-Leveling is to have all cells receive the same number writes in order to avoid consecutive P/E cycles on the same blocks. Wear-Leveling algorithm is typically managed by the Flash controller to determine which physical block to use each time data is programmed. The use of Wear-Leveling can enhance the reliability and the life of the storage device specifically for enterprise system applications.
Flash-based storage devices are different in the way they deal with previously deleted data compared to traditional disks. Data must be erased first before new data can be written to the same block in SSDs. Garbage Collection copies in-use data to a new block, and then deletes all data from the old one.
Flash-based storage devices are different from traditional hard disk drives where new data typically over-write old data in place at the same physical location. In SSDs, new data are written to open Flash memory blocks and the corresponding old data associated with the logical addresses are invalidated. Garbage Collection consolidates new and valid data into contiguous memory locations and then deletes invalid data from physical Flash memory to free up memory blocks for new data.
A Flash cell is made up of pages, and several pages make up a block. SSDs read and write data as pages, and erase data at the block level. To write new data to a used block, an SSD controller must first copy all valid data, and write it to empty pages of a different block. It then erases both valid and invalid data in the current block before the newly erased block can be written in new data. This process is called Garbage Collection.
TRIM is a command with the help of which the operating system can tell the SSD which blocks are no longer needed and can be deleted, or are marked as free for rewriting. With the TRIM command, it not only reduces the Write Amplifier Factor (WAF) but also boost the data access speeds.
While Flash-based devices are a block-level physical construction programmed into the controller, operating systems have their own organizational schemes—file systems. The SSD controller knows which blocks are in use and which are free, but doesn't know which blocks correspond to which files. Things get complicated when either the SSD or the OS deletes files.
TRIM command provides a bridge from the file level to the block level, giving the OS a way to tell the SSD that it’s deleting files and to mark those files being deleted to be removed during the next Garbage Collection run.
Without TRIM commands, Garbage Collection wouldn't know which files are deleted by the OS, so it continues to move pages containing the deleted data along with good pages, causing the increasing write amplification. This is where the TRIM command works. The SSD controller is told by TRIM to stop collecting pages with deleted data so that they get left behind and erased with the rest of the block.
TRIM commands can improve the Garbage Collection process, reduce the write amplification, and extend the life and performance of an SSD.
Over-Provisioning is a technology where a certain portion of the physical capacity of the memory is reserved for carrying out garbage collection, wear-leveling and bad block management. It effectively reduces the attribute of write amplification and extends the lifespan of an SSD.
A certain portion of the SSD drive space is reserved for performing various internal memory management functions. The over-provisioned space cannot be used and accessed by users, and is invisible to the host operating systems.
SSD manufactures typically reserve 7% of physical storage space as over-provision for background activities, such as Garbage Collection and Wear-Leveling. Take an SSD with 256GB raw capacity as an example. It inherently reserves 7% raw capacity for the built-in over-provisioning leaving 240GB of actual user capacity. Higher over-provisioning percentage (e.g. 28%) can further improve performance and SSD product lifetime.
When Garbage Collection is triggered, it will affect SSD’s write performance depending on the scale of scattered data. The less the data is scattered in the NAND Flash and the higher percentage of over-provisioning, the less likelihood that the performance will be affected because the work cycles of Garbage Collection are reduced. In other words, the write amplification is reduced, so the lifespan of the SSD is increased. The following graphics show the results of the endurance enhancement in different Over-Provisioning levels.
SMART Modular Technologies helps customers around the world enable high performance computing through the design, development, and advanced packaging of integrated memory solutions. Our portfolio ranges from today’s leading edge memory technologies to standard and legacy DRAM and Flash storage products. For more than three decades, we’ve provided standard, ruggedized, and custom memory and storage solutions that meet the needs of diverse applications in high-growth markets. Contact us today for more information.
At Penguin, our team designs, builds, deploys, and manages high-performance, high-availability HPC & AI enterprise solutions, empowering customers to achieve their breakthrough innovations.
Reach out today and let's discuss your infrastructure solution project needs.