Skip to content

Grey Beards on Storage

Analyst defined storage

  • GreyBeardsOnStorage Podcast
  • Silverton Consulting
  • Silverton Space
  • The CTO Advisor
  • About
  • RayOnStorage Blog

Tag: LUSTRE storage system

Posted on May 31, 2023May 26, 2023

149: GreyBeards talk HPC storage with Dustin Leverman, Group Leader, HPC storage at ORNL

Ran across an article discussing ORION, ORNL’s new storage system which had 100s of PB of file storage and supported TB/sec of bandwidth, so naturally I thought the GreyBeards need to talk to these people. I reached out and Dustin Leverman, Group Leader HPC storage at Oak Ridge National Labs (ORNL)answered the call. Dustin has been in HPC storage for a long time and at ORNL, he has helped deploy Orion, an almost 670PB, multi-tier file storage system for Frontier supercomputer users.

Orion is a LUSTRE file system based on HPE (Cray) ClusterStor with ~10PB of metadata, 11PB of NVMe flash and 649PB of disk. How the system handles, multi-tiering is unique AFAIK. It performs 11TB/sec of write IO and 14TB/sec of read IO. Note, that’s TeraBytes/sec not TeraBits. Listen to the podcast to learn more

https://media.blubrry.com/greybeardsonstorage/greybeardsonstorage.com/wp-content/uploads/Podcasts/2023/05/GBoS-PC-20230519.mp3

Podcast: Play in new window | Download (Duration: 49:13 — 67.6MB) | Embed

Subscribe: Apple Podcasts | Google Podcasts | Spotify | Stitcher | Email | RSS

While designing Orion, ORNL found their users have a very bi-(tri-?)modal file size distribution. That is, many of their files are under 256KB, a lot under are 8MB and the remaining all over 8MB. As a result they added Progressive File Placement to support multi-tiering on LUSTRE.

Orion has 3 tiers of data storage. The 1st tier is 10PB NVMe SSD storage metadata tier. Orion also uses Data on Metadata, which stores the 1st 256KB of every file along with the file metadata. So, accessing very small files (<256GB) is all done out of the metadata tier. But what’s interesting is that the first 256GB of every file on ORION is located on the metadata tier

Orion’s 2nd tier is 11PB NVMe SSD flash tier. On this tier they store all file data over 256GB and under 8MB. NVMe flash tier is not as fast as the metadata tier but it supports another large chunk of ORNL files.

The final Orion tier is 649PB of spinning disk storage. Here it stores all file data that is larger than 8MB. Yes it’s slower than the other 2 tiers, but it makes that up in volume. Very large files, will find they can predictably access the first 256GB, the next 8MB (- 256GB) of data and then have to use disk to access any file data after that.

It’s important to note that Orion doesn’t support hot data in the upper tiers and cold data in the lowest tier as many multi-tier storage systems do. Rather Orion multi-tiering just tiers different segment of all file data on different tiers depending on where that data resides in the file’s storage space.

In addition to Orion file storage, ORNL also has an archive storage that uses HPSS and Spectrum Archive. Dustin mentioned that ORNL’s HPC data archive is accessed more frequently than typical archive storage, so there’s lots of data movement going between archive and Orion.

Orion supports metadata nodes and object storage targets (OSTs, storage nodes). Each OST has 1 flash target (made up of many SSDs) and 2 disk targets (made up of many disk drives).

Dustin mentioned that Orion has 450 OSTs, which in aggregate support 5.4K-3.84TB NVMe SSDs and 47.7K-18TB disk drives. Doing the math, that’s 20.7PB of NVMe flash and 858.6PB of disk storage.

ORION data is protected using ZFS RAID2, or can sustain up to 2 drive failures without losing data. Their stripe has 8 data and 2 parity drives plus 2 spares.

Keith asked how does one manage 670PB of LUSTRE storage. Dustin said they have a team of people with many software tools to support it. First and foremost, they take lot’s of telemetry off of all OSTs and metadata servers to understand what’s going on in the storage cluster. They use SMART data to predict which drives will go bad before they actually go bad. He mentioned that using telemetry, they can tell what kind of performance an app is driving and can use this to tweak what file systems an app uses.

I asked Dustin how he updates a 450 OST + [N] metadata node storage system. They take the cluster down when it needs to be updated. But before that, they regression test any update in their lab and when ready, roll it out to the whole cluster. Dustin said many problems only show up at scale, which means that an update can only truly be tested , when the whole cluster is in operation.

I asked Dustin whether they were doing any AI/ML work at ORNL. He said yes, but this is not on Orion directly but uses compute server mirrored DAS NVMe storage. He said that AI/ML workloads don’t require lot’s of data and using DAS makes it go as quick as possible.

Dustin mentioned that ORNL is a DoE funded lab so any changes they make to LUSTREare submitted back to the repository for inclusion into next release of LUSTRE.

Dustin Leaverman, Group Leader HPC storage at Oak Ridge National Labs

Researcher profile – Dustin Leverman with Frontier’s Orion file system, January 10, 2023.

Dustin Leverman is the Group Leader for HPC Storage and Archive Group of the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL). The NCCS is home to Orion, the 700 petabyte file system that supports Frontier, the world’s first exascale supercomputing system and fastest computer in the world.

Dustin began his career at ORNL in 2009. He was previously a team leader in the HPC and Data Operations Group. In his current role, Dustin oversees procurement, administration, and support of high-speed parallel file systems and archive capabilities to enable the National Center for Computational Sciences’ overall mission of leadership-class and scalable computing programs.

Subscribe to Podcast via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 10.7K other subscribers
The GreyBeards in caricature

Top Podcasts

  • 150: GreyBeard talks Zero Trust with Jonathan Halstuch, Co-founder & CTO, RackTop Systems
  • 120: GreyBeards talk CEPH storage with Phil Straw, Co-Founder & CEO, SoftIron

Recent Posts

  • 154: GreyBeards annual VMware Explore wrap-up podcast
  • 153: GreyBeards annual FMS2023 wrapup with Jim Handy, General Director, Objective Analysis
  • 152: GreyBeards talk agent-less data security with Jonathan Halstuch, Co-Founder & CTO, RackTop Systems
  • 151: GreyBeards talk AI (ML) performance benchmarks with David Kanter, Exec. Dir. MLCommons
  • 150: GreyBeard talks Zero Trust with Jonathan Halstuch, Co-founder & CTO, RackTop Systems

Recent Comments

J Metz on 134: GreyBeards talk (storage)…
Administrator on 68: GreyBeards talk NVMeoF/TCP…
Administrator on 68: GreyBeards talk NVMeoF/TCP…
VR Satish on 68: GreyBeards talk NVMeoF/TCP…
Administrator on 60: GreyBeards talk cloud data…

Tags

  • 2D NAND
  • 3D Crosspoint memory
  • 3D NAND
  • 3D TLC NAND
  • 3DX
  • 3D Xpoint
  • 3D XPoint NVM
  • 14th generation server
  • 100PB probems
  • Actifio
  • Active Archive Alliance
  • Agentless
  • Agentless security
  • Agile development
  • AI/ML/DL apps
  • AI for storage
  • AI IO bandwidth
  • AI ML DL NN model training & inferencing
  • AI storage
  • AI workloads
  • All Flash Array
  • Alluxio
  • Altera
  • Amazon S3
  • ANSI SQL
  • Apache Arrow
  • Apache Arrow Flight
  • Apache Arrow Gandiva
  • Apache Software Foundation
  • Apeiron
  • APIs
  • Appliances
  • Archive
  • Argonne National Labs
  • ATA over Ethernet
  • Avere Systems
  • AWS
  • AWS S3
  • AWS Snowflake
  • B2 cloud storage
  • Backblaze B2
  • Backdating data
  • Backing store
  • BeeGFS
  • BeeOND
  • BI apps
  • Big Data
  • BioPharma storage
  • Bitcoin
  • Bit Quill Technologies
  • BlackPearl Deep Storage
  • Blockchain consensus
  • Blockchains
  • Broadcom
  • burn-in
  • Burst File Systems
  • Byzantine Generals' Problem
  • Caching appliance
  • Caching software
  • Cassandra backups
  • Catalog DNA
  • CEPHFS
  • ChatGPT
  • Chia coin
  • Chiplets and UCIe
  • Chip manufacturing
  • Cisco
  • ClearSky Data
  • Cleversafe
  • Cloud-integrated storage
  • Cloud defined storage
  • Cloud DR
  • Cloud ephemeral storage
  • CloudFlex pricing
  • Cloud migration
  • Cloud NAS
  • Cloud Native ecosystem
  • Cloud ONTAP
  • cloud storage appliance
  • Cloud Sync
  • Clustering software
  • Cluster storage
  • CNCF
  • CNCF Rook
  • Cohesity
  • Cohesity Hellios
  • Cohesity SmartFiles
  • Columnar database
  • Commvault Activate
  • Commvault Data Platform
  • CommvaultGO 2019
  • Commvault Hedvig
  • Commvault Hyperscale Secondary Storage
  • Commvault IntelliSnap
  • Commvault Live Mount
  • Commvault Live Sync
  • Commvault Metallic
  • Composable Infrastructure
  • Composable storage infrastructure
  • Computational defined storage
  • Computational storage
  • Containerization
  • Containers
  • content indexing
  • Continuous data protection (CDP)
  • Converged infrastructure
  • Copy data management
  • Coraid
  • COVID-19 pandemic
  • COVID19
  • CPU vs GPU cores
  • Crossbar
  • Crypto currency
  • CXL
  • cyber securityi
  • DAOS
  • DAS
  • Data-at-rest encryption
  • Data-aware storage
  • Data analytics
  • Databricks
  • Data center problems
  • Data centric computing
  • Data compromize
  • Datadobi
  • Data Domain
  • Data governance
  • DataGravity
  • Data Gravity
  • Data mesh
  • Data on Metadata
  • Data orchestration
  • data placement
  • Data security
  • Data SLAs
  • data storage
  • Datrium
  • DDN
  • DDN ExaScaler & GridScaler
  • DDN WOS: object store
  • Deduplication
  • Dell-EMC
  • Dell Compellent
  • Dell EMC CloudIQ
  • Dell EMC Cloud Tiering Appliance
  • Dell EMC Data Domain Boost
  • Dell EMC Data Protection Suite
  • Dell EMC Health Check Service offering
  • Dell EMC IDPA DP4400
  • Dell EMC SC Series Storage
  • Dell EMC Unity AFA
  • Dell EMC Unity Storage
  • Dell EMC VMAX 950F
  • Dell EMC World2017
  • Desktop backup
  • Dheeraj Pandey
  • DirectMemory
  • Disaggregated consensus
  • Disaster recovery
  • Disk
  • Disk AFR
  • Disk density/price trends
  • Disk drive
  • Disk drive capacity
  • Disk drive trends
  • Disk MTBF
  • Disk pricing
  • Disk pricing trends
  • Disk recording head technology
  • Disk reliability
  • Disk vs. SSD market dynamics
  • distributed.net
  • Distributed Architecture Object Storage
  • Distributed dedupe
  • DNA replication
  • Docker Container
  • DPU
  • Dremio
  • DriveScale
  • DVX Hyperdriver
  • DVX NetShelf
  • E8 Storage
  • EB of data
  • Edge-core filers
  • Edge/IoT
  • elemental restore
  • EMC DSSD
  • EMC Emerging Technology Division
  • EMC ScaleIO
  • EMC ViPR Controller
  • EMC VNXe
  • EMC VxRack
  • EMCWorld2015
  • EMC XtremIO
  • Encryption
  • engineering quality
  • Enterprise storage
  • Enterprise storage vendors
  • Erasure Coding
  • Erasure coding. Geographic dispersion
  • Everledger
  • Everspin
  • EVO RAIL
  • Excelero
  • Extract transfer and load
  • FeRAM
  • Fibre Channel
  • File analytics
  • File backup
  • File synch&share
  • File virtualization
  • Financial services
  • FlashArray//C
  • FlashArrayllX
  • Flash disk price crossover
  • Flash Memory Summit
  • Flash pricing
  • FlexPod
  • FMS 2016
  • FMS2022
  • FMS2023
  • Folding@Home
  • FPGA
  • Frederic Van Haren
  • Free forever edition
  • Full text indexing/search
  • Fungible DPU
  • Future NVM technologies
  • Game theory
  • GDPR
  • GekkoFS
  • Globalization
  • Global storage service
  • Google Cloud Platform
  • Google FAST paper
  • GPT-3
  • GPU Direct Storage
  • GPUs
  • Greenfield apps
  • GreenQloud
  • HCI
  • HCP for Cloud Scale
  • HDFS
  • High availability
  • High bandwidth
  • High IO performance
  • high touch
  • Hitachi Content Intelligence
  • Hitachi EverFlex purchasing options
  • Hitachi Ops Center
  • Hitachi VSP E990 midrange all NVMe storage
  • Host Network Adaptors (HNA)
  • Host software
  • HPC
  • HPE Discover
  • HPE MSA storage
  • HPE storage
  • HPE Vertica SQL
  • Hyper-converged infrastructure
  • Hyper-scalers
  • Hyper-V backup
  • Hyperledger
  • IBM
  • IBM FlashCore Modules
  • IBM FlashSystem 9100
  • IBM GPFS
  • IBM Spectrum Connect
  • IBM Spectrum Fusion
  • IBM Spectrum Protect Plus
  • IBM Spectrum Scale
  • IBM Spectrum Virtualize
  • IBM Spectrum Virtualize for the cloud
  • Igneous
  • Igneous DataDiscover
  • Igneous DataProtect
  • Indexing backups
  • Industry consolidation
  • Infinidat
  • Infinio
  • In memory databases
  • Intel
  • Intel's Optane
  • Intel Optane DC PM
  • intent vs actual use
  • IO Density
  • IPoE
  • IT
  • K8s
  • K8s container PV
  • K8S data protection
  • K8s PVCs
  • K8S storage
  • Kafka
  • Kaminario
  • Key-Value store
  • Kubernetes
  • KVM
  • Lambda services
  • Lenovo
  • Linux data services
  • Linux storage stack
  • Liqid
  • Log Structured Merge Tree
  • Low latency IO
  • LucidLink
  • Lustre
  • Lustre parallel file systems
  • LUSTRE storage system
  • Machine Learning
  • MAMR vs. HAMR
  • Mangstor
  • MapR Technologies
  • MAX Data
  • Media & Entertainment
  • Mellanox
  • Memory Machine
  • Memory tiering
  • Metro networking
  • Micron
  • Microsoft Azure
  • Microsoft vs. VMware
  • MinIO
  • MLC NAND
  • MLCommons
  • MLperf benchmarks
  • MongoDB backups
  • MRAM
  • Multi-cloud
  • MySQL backups
  • NAND
  • NAND market trends
  • NAND pricing trend
  • NAND shortages
  • NAND Spot market
  • NAS proactive security
  • NConnect
  • Nebulon
  • NetApp
  • NetApp A-Team
  • NetApp AFF A800
  • NetApp AI
  • NetApp Cloud Data Volumes
  • NetApp Cloud Insights
  • NetApp Data Fabric
  • NetApp E-Series
  • NetApp HCI
  • NetApp OnCommand Insight
  • NetApp Private Storage
  • NetApp SolidFire
  • NetApp United
  • NexGen Storage
  • NFS
  • NGD Systems
  • NoSQL database
  • Nuance
  • Nutanix
  • NVDIMMs
  • NVIDIA
  • NVIDIA CUDA
  • NVIDIA DGX
  • NVIDIA DGX-A100 Reference Architectures.
  • NVIDIA GPUDirect Storage
  • NVIDIA Networking
  • NVMe
  • NVMe Gen 4 SSDs
  • NVMeoF
  • NVMeoF/Ethernet
  • NVMeoF/FC
  • NVMeoF/TCP
  • NVMesh
  • NVMe SSDs
  • observation and enforcement modes
  • Ocient
  • Omni Path
  • Open-Flow
  • Open convergence
  • OpenIO
  • OpenSolaris
  • open source
  • Openstack SWIFT
  • OpenZFS
  • Optane
  • Optane DC PM
  • Optane DIMMs
  • Optane PMEM
  • Optane SSDs
  • Optical storage
  • ORNL ORION storage
  • Parallel File System
  • Paralleli access
  • PCIe Flash
  • PCIe SSD
  • Pentaho
  • Performance optimized storage
  • Pernixdata
  • Persistent memory
  • Persistent volumes
  • Peter Godman
  • Pivot3
  • PLC NAND
  • Plexistor
  • PMEM
  • podcasts
  • PoP storage
  • Portworx
  • POSIX
  • PowerFlex Fault Sets
  • PowerFlex Protection Domain
  • Primary storage
  • Private AI Foundation
  • Process yields
  • Progressive file placement
  • Projec Pacific
  • Protein simulation
  • Protocol Endpoints
  • Pure Storage
  • Python PySpark
  • QoS
  • Quaddra Software
  • Quality of Service (QoS)
  • Qumulo
  • RackTop Systems
  • RADOS Block Device RBD
  • Ransomware
  • Ransomware dwell time
  • Ransomware protection
  • RDMA
  • RDMA-RoCE
  • Realtime IO monitoring
  • Red Hat Ceph storage
  • Red Hat Gluster
  • Redhat KVM
  • Red Hat Virtualization
  • Reduxio
  • ReRAM
  • RoCE
  • RocksDB
  • R on Spark
  • Rook-Ceph
  • Rubrik
  • Rubrik Cloud Data Protection
  • Rubrik Datos IO
  • Rubrik Polaris GPS
  • S3 compatible object storage
  • S3 object storage
  • S3 storage service
  • S3 Ventures
  • SAS LUN
  • SAS SSDs
  • Scale-out backup appliance
  • scale-out storage
  • Scale across storage
  • SCM
  • Scratch Proccessisg
  • SCSI commands
  • SDC
  • SDXI SNIA
  • Secondary storage
  • Secured NAS
  • Security
  • security assessors
  • Server virtualization
  • Seti@Home
  • SFD
  • SFD4
  • SFD5
  • SFD8
  • SFD10
  • SFD12
  • Similarity hash
  • Simplivity
  • SKA
  • SLC NAND
  • Slurm workload scheduler
  • Smart SSDs
  • SMR disk
  • Snapshot
  • SNIA
  • SoftIron HyperDrive
  • software defined storage
  • Software enabled/defined flash
  • Software supply chain risk
  • SolidFire
  • Spark
  • SPC-1 problems
  • SPEC sfs2014 SWBUILD
  • Speech Recognition
  • Speedb
  • Split k8s clusters
  • SPU
  • Square Kilometer Array
  • SSD
  • SSD effect
  • SSD overprovisioning
  • SSD pricing
  • SSTs
  • Storage as a Service
  • storage auto-tiering
  • Storage Class Memory
  • Storage Pod
  • Storage startups
  • storage tiering
  • Supercomputing cloud
  • Swordfish
  • Symbolic IO
  • Tag store
  • Tail latency
  • Tanzu Application Platform (TAP)
  • Tanzu K8s Operations (TKO)
  • tape
  • Tb/sqin
  • Tech Field Day
  • Tech refresh
  • Teleporting PV data
  • Throughput workloads
  • Tier 0 storage
  • Tier 2 storage
  • Time shifting storage
  • TLC NAND
  • Trident plugin
  • Trillion row databases
  • True Fabric
  • VAST Data
  • vBlock
  • VC funding
  • Veritas NetBackup
  • Violin Concerto All Flash Array
  • Violin Windows All Flash Array
  • Virtual Volumes
  • VMDKs
  • VMware
  • VMware Aria
  • VMware backup
  • VMware cloud flex compute
  • VMware cloud flex storage
  • VMware Cloud Foundation
  • VMware Cross-Cloud Services
  • VMware Explore 2023
  • VMware SDDC
  • VMware storage
  • VMware VSAN
  • VMware vSAN 5.5 EoGS
  • VMworld2016
  • Voltron Data
  • vSAN 8
  • vSphere
  • vSphere 5.5 EoGS
  • vSphere 8
  • VxBlock 1000
  • VxBlock AMP vSAN
  • WekaIO
  • Work From Home (WFH)
  • Write-back cache
  • Write-through caching
  • Write amplification
  • X86/ARM cores
  • Year
  • Z* compression algorithm
  • z/Linux
  • Zero Trust Architecture
  • Zero Trust security
  • ZFS
  • RSS - Posts
  • RSS - Comments

Archives

Categories

© 2013-2019 GreyBeardsOnStorage.com, All Rights Reserved

Powered by WordPress.com.
 

Loading Comments...