Cloud Collaboration Center
To comply with my non-disclosure agreement, this case study is just an overview of my learnings and contributions on the project and does not deep-dive in the project specifics.
How can we monitor the health of Azure's
infrastructure to proactively identify incidents?
The Cloud Collaboration Center (CCC) is a physical facility located in Microsoft's headquarters in Redmond. With a 180-degree view and 1,600-square-foot video walls, it enables real-time troubleshooting of issues by our site reliability engineers to ensure that Azure is running efficiently, and our customers can have a reliable experience.
The main wall of the Cloud Collaboration Center. Credit: Microsoft
My Role
In 2019, our design team inherited this project. I was assigned to it so I could bring my learnings about internal tools from Microsoft's Incident Management ecosystem to better integrate the CCC with our monitoring portfolio.
Initially, we identified three main investment areas:
The mapping of a dedicated design system so we could have a more cohesive experience catered to the uniqueness of the space.
More strategic integrations with other products so the CCC facility served as an extension of the existing monitoring and reactive tools at the company.
A way to scale as new infrastructure was being created in Azure. That year Azure was preparing the satellite launch for Azure Space in partnership with Space X. The facilities wanted to also track the subsea Marea cable, and new datacenter were emerging. There was a lot of momentum and the facility needed to scale fast to monitor a fast-growing infrastructure.
What I did
Defined design principles
Interaction design
Product vision + Scale strategy
Integrations across tools
Data visualization models
Accessibility physical + digital audit
Fostering relationships with partners
Manage UX and visual designers
Merge Physical and Digital tools
How do Service Reliability Engineers use the facility?
Service Reliability Engineers at Microsoft are our primary users and they have the role of monitoring, identifying and proactively solving any compromises in the health of the Azure Cloud that might impact our customers. They will sit in the room and aim to find proactive and reactive opportunities to keep Azure services up and running.
The Cloud Collaboration Center has three video walls that organize their data in different themes:
Left wall, infrastructure health: Here there is an overview of Datacenters, network traffic, Satellites, Subsea Cables, Virtual Machines and other infrastructure’s health.
Front wall, Regional health: This wall displays health at a regional level. It shows a world map highlighting different continents at the time with a comprehensive view of regions and updates from geo-distributed engineering teams from around the globe to scale cloud operations.
Right wall, Customer Insights: This wall includes support cases, tweets, escalations, and High-Priority events to help understand data based on customer insights.
Challenge themes to address
We identified two unique challenge themes that kept us busy finding solutions for:
CHALLENGE THEME
Limited Real State: When 380 million pixels are not enough
The large video walls used to show data at a planetary scale might seem a lot, but when the premise is to display a comprehensive view of the health of Azure and its ever evolving, fast-paced scale, it’s extremely easy to run out of space. This supposed to the UX team a complex design problem: How do we best utilize our space and ensure we can scale at the same rate as Azure? We had to rethink how we optimize designing for scale and large form factors.
Partnering with the accessibility team
We needed some help for the pros in order to tackle visual, cognitive and motion barriers in the facility. We partnered with the C+AI Inclusive and Accessible Team at Microsoft to run an audit and understand what changes needed to take effect to help the CCC be more accessible. With them, we conducted a research study on-site to determine areas of opportunity, and participated in several Data Visualization trainings offered by the team. This helped us rethink our design patterns and aim for better accessible practices.
PROBLEM ONE
Am I including information my users don't need to consume?
When working with a limited amount of space, is imperative to aggressively prioritize the problem we are trying to solve, as sometimes less is more. Because of the large scale of the facilities, it was easy for us to get carried away and wanting to display all the data available, causing cognitive overload and defeating the purpose of a command center.
From words to patterns
The UX and product team identified the wish to use the CCC walls exclusively to be used for pattern recognition and trend spotting, instead of reading/exploring data. This clarity allowed us to reimagine the walls and to detract from data that was steering us away from the goal, to help users triangulate and co-relate issues. However, pattern recognition alone is not enough to keep Azure healthy, so aside from the room, we saw the need to invest in a drill-down web-app that allows our users to deep-dive in an anomaly spotted in the walls and troubleshoot fast, complementing the room’s functionality without randomizing the user.
At the facility, Reliability engineers triangulate data across multiple screens to recognize trends and patterns.
PROBLEM TWO
Am I conveying the information in the right way?
Imagine a large room full of screens. Font sizes and readability standard rules quickly became inefficient for us because our user could be sitting in different areas of the space- or even worse, moving around!
Taking learning 1 and focusing on the goal of pattern recognition, we realized that data visualizations would take priority over labels, text or other written components of the walls. Due to the physical constraints of the room and the high value of immediate and proactive trend-spotting, we had to tell a story without words. Data visualizations allowed us for more data-density and faster identification of anomalies.
Sections on the walls will have a dedicated header with a repetitive structure, an area for a summary and visualizations with legends.
PROBLEM THREE
Can my information be grouped more clearly?
Information architecture can play a big role in effectively organizing, labelling and grouping data. In order to not overwhelmed the user with complex information on the walls, we identified logical groupings and organized the screens accordingly. This allowed us to tell a cohesive story and to help the user understand where to find a specific key trend. For groupings to be effective we used consistent headers, sub headers and legends. Within the groups, we also opted for displaying the potential compromised areas first, to help the user sort the information by importance level. This strategy created a path for the viewer’s eye to follow and scan for information between multiple sections of the facility.
PROBLEM FOUR
Can motion help me highlight or scale my data?
Sometimes, too much data is simply too much data to fit in the screen. Designing a product that cannot be interacted with, we had to get creative to allow more information in less space. We started refreshing screens in the same way that flight/train information boards rotate data periodically to display portions of data at the time. We also started to leverage subtle animations to indicate to the engineers a change in a trend or pattern.
Airport board refresh information at a predictable rate to be able to scale in limited real state.
We quickly realized that motion could be a double-edged sword; if used right, could allow us to scale and enrich our data. If used wrong, could cause cognitive overload and physical discomfort. So we partnered with the C+AI Accessibility team to better understand principles of motion design and accessibility concerns when a user is not in control of pausing the rotation on the screen. As a general rule of thumb, we learned .that users with disabilities needs content to show for double the amount of time as users without disabilities. After exploring different refresh rate times, we made an informed guess about the number of minutes per transition in the loops.
Motion also helped us to add subtle animations to changes and to allow the user to quickly pivot and spot differences.
CHALLENGE THEME
Cognitive Overload: How to focus when you are surrounded by data
As the walls got busier, the attention to our users got more compromised. In a command center where time matters, our goal as the UX team was to support the user in recognizing issues and take action as fast as possible. Once again we went back to the basic design principles and explored ways to reduce the cognitive overload of the walls.
PROBLEM ONE
What is my primary and secondary information?
Health is binary, and because of that, we instinctively wanted to show its both sides, the healthy and the unhealthy instances. However, after some introspection we realized that healthy states are secondary to the purpose of the command center. We helped engineers to visually isolate areas of interest by making healthy states secondary and unhealthy events the primary focus of attention.
PROBLEM TWO
Am I using color the right way?
Following up with the duality of conveying health, using red and green was extremely enticing. However, once we learned that healthy states were secondary information for our users, we decided that black was going to be our new green. To avoid color overload and risking a very taxing experience for the eyes in a dark room, we opted for a primary dark theme with a monochromatic scale as our baseline, with an accent color of purple to depict anomalies. This allowed us the ability to declutter the screens with yellows, oranges and greens, and allow the user’s focus exclusively to pattern recognition- which does not necessarily need to be an alarming and distracting red. This palette was soothing to the eye, promoted collaboration and guided the user’s attention to important content.
Purple on different tones convey anomalies without color-loading the visualizations.
PROBLEM THREE
How am I helping the user to see the big picture?
In the facility, dynamic card summaries of regions are an example of summarization.
Sometimes is necessary to invest some real state in strategic data that summarizes the whole and allows for quick insights. We opted to dedicate some space to summary sections, to minimize the time necessary to investigate issues and spot trends and reduces the cognitive load, increasing our users' efficiency.
Aside from dedicated summary sections, we often learned that is worth including total numbers in titles, subtitles, table headers, etc. Sometimes a quick glance at the overall picture can help users to get a quick sense and digest the information faster.
Closing thoughts
Designing a command center is a complex and unique task. There are limitations, challenges and opportunities that does not exist in traditional web design. Movign forward, we are looking at ways to integrate AI in the facility to better assist with predictive data and analytics, so our approach goes from reactive to proactive. We are also merging the physical experience with a digital one that would enable deep-dives and drill-downs from the data on the screen. I am sure the CCC would play a more centric role across monitoring and reactive experience tools at Microsoft, and I look forward to see what the future holds for the facility!