EX172491 – Can’t access email, l’apocalisse di Microsoft Office 365 (Aggiornato)

Gioxx  —  25/01/2019 — Leave a comment

È il titolo comparso intorno alle 10:20 (ora italiana) di ieri nell’area amministrativa di Office 365 (a voler fare i pignoli si parla di 2019-01-24 09:17 (UTC)). Una manciata di minuti prima della comparsa dell’anomalia il team di primo livello ha cominciato a ricevere segnalazioni dagli utenti che non accedevano più al loro Outlook e in generale alla posta elettronica di Exchange in Cloud (quindi anche via OWA). Il tempo di aprire un ticket in Microsoft ed ecco spuntare fuori il primo aggiornamento sull’area amministrativa del servizio, sarebbe stato il primo di tanti: “We’re investigating a potential issue and checking for impact to your organization. We’ll provide an update within 60 minutes.“.

EX172491 - Can't access email

La chiamano “Riduzione del servizio“, si traduce con “Panico totale e corsa ai ripari” perché nonostante ciò che si continua a sbraitare in giro, la posta elettronica è e rimane al centro dell’universo per tutti coloro che svolgono un lavoro d’ufficio, intersecandosi anche in altri settori che tramite essa portano a termine compiti automatizzati e non (penso alla realtà di un grande gruppo come quello per cui lavoro, ma di nomi presi in pieno volto oggi se ne potrebbero fare parecchi altri, assai conosciuti) dove persino il mestiere più manuale è comunque preso “di striscio” dal colpo. L’incidente frontale ti fa capire quanto una struttura in Cloud, seppur curata e tenuta sempre d’occhio, possa rivelarsi un grandissimo Point of Failure davanti al quale tu sei completamente impotente, a prescindere dalla preparazione tecnica che puoi avere e che potresti mettere in campo.

Microsoft Office 365 è un servizio ormai utilizzatissimo, ne è dimostrazione la quantità di discussioni generatasi su svariati Social Network, forum di discussione e blog specializzati con relative aree commenti. Non sono stati esenti dall’ondata di segnalazioni anche quegli strumenti che negli ultimi tempi stanno predendo sempre più piede, come Outage.report (qui la pagina di Office 365: outage.report/office-365) o DownDetector (downdetector.it/problemi/office-365).

Day 1

Dopo una giornata di completo buio, le prime mail hanno cominciato a raggiungere Outlook e Mail (l’applicazione nativa di iOS) intorno alle 21:00 ora italiana. Poca roba, troppo poca davvero, il disservizio infatti persisteva nonostante un fix annunciato e implementato da Microsoft. Ti riporto i messaggi di aggiornamenti fino al primo passo verso la luce (il primo è quello più recente, nda):

1/24/2019 10:53:52 PM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’re continuing to monitor the environment to ensure that service is recovering and email is being delivered as expected.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers are in a degraded state, affecting Exchange Online functionality.
  • Next update by: Thursday, January 24, 2019, at 11:00 PM UTC
1/24/2019 9:55:54 PM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’re continuing to experience connectivity improvements and observe that email is beginning to deliver as expected. In parallel, we’re making additional changes as needed and are monitoring our email queues to ensure they continue to process.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers are in a degraded state, affecting Exchange Online functionality.
  • Next update by: Thursday, January 24, 2019, at 10:00 PM UTC
1/24/2019 8:53:18 PM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’ve observed an increase in connectivity after deploying our solution in the affected environment. We’re implementing additional changes in an effort to provide continued relief.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers are in a degraded state, affecting Exchange Online functionality.
  • Next update by: Thursday, January 24, 2019, at 9:00 PM UTC
1/24/2019 7:55:12 PM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’ve identified a potential fix to address this issue and are testing the fix to confirm that it is a viable solution.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers are in a degraded state, affecting Exchange Online functionality.
  • Next update by: Thursday, January 24, 2019, at 8:00 PM UTC
1/24/2019 6:00:42 PM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: Our efforts to restore connectivity to the affected domain controllers continues. In parallel, we’re analyzing data to identify alternative means to restore service and better understand the underlying source of this problem.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers are in a degraded state, affecting Exchange Online functionality.
  • Next update by: Thursday, January 24, 2019, at 7:00 PM UTC
1/24/2019 4:08:31 PM
  • Title: Can’t access email
  • User Impact: Users may be unable to connect to the Exchange Online service.
  • More info: As a result of this issue, users will be experiencing issues when they attempt to send and receive email.
  • Current status: We’re continuing to fix the unhealthy Domain Controllers while actively monitoring the connections to the healthy infrastructure. Additionally, we’re reviewing system logs from the unhealthy Domain Controllers to understand the underlying cause of the issue.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Next update by: Thursday, January 24, 2019, at 5:00 PM UTC
1/24/2019 1:57:40 PM
  • Title: Can’t access email
  • User Impact: Users may be unable to connect to the Exchange Online service.
  • Current status: We’ve determined that a subset of Domain Controller infrastructure is unresponsive, which is resulting in user connection time outs. We’re optimising connectivity to the healthy infrastructure while fixing the unhealthy Domain Controllers.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Next update by: Thursday, January 24, 2019, at 3:00 PM UTC
1/24/2019 12:09:46 PM
  • Title: Can’t access email
  • User Impact: Users may be unable to connect to the Exchange Online service.
  • Current status: We’ve identified that a networking issue within the Exchange Online infrastructure may be causing impact. We’re looking into connectivity logs to determine the underlying cause and remediate impact.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Next update by: Thursday, January 24, 2019, at 1:00 PM UTC
1/24/2019 10:54:10 AM
  • Title: Can’t access email
  • User Impact: Users may be unable to connect to the Exchange Online service.
  • Current status: We’ve determined that a subset of mailbox database infrastructure became degraded, causing impact. We’re identifying the next troubleshooting steps to remediate impact.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Next update by: Thursday, January 24, 2019, at 11:00 AM UTC
1/24/2019 10:18:04 AM

We’re investigating a potential issue and checking for impact to your organization. We’ll provide an update within 60 minutes.

Day 2

Seppur a fatica, le email hanno continuato ad arrivare nella casella di posta elettronica e quindi nei vari client configurati e connessi. Il tutto scoppia nuovamente nel corso della mattinata italiana, con un evidente sovraccarico delle risorse messe a disposizione. Forse un’ora di buio totale, tutto il resto è connessione più o meno stabile ma grandi ritardi nella consegna delle email. A partire dalle 14 circa (ora italiana) la situazione sembra essersi nettamente più stabilizzata. Qui di seguito ti propongo gli aggiornamenti di stato di Microsoft:

1/25/2019 3:04:30 PM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’ve determined that excessive load is causing queues within the authentication infrastructure. We’re formulating a plan to remediate impact.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers were in a degraded state, affecting Exchange Online functionality.
  • Next update by: Friday, January 25, 2019, at 6:00 PM UTC
1/25/2019 1:01:46 PM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’ve determined that there is higher than expected queuing within the authentication infrastructure, which may be the cause of impact. We’re working to identify the cause of these queues and determine steps to remediate impact.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: Previously, a subset of our managed Domain Controllers were in a degraded state, affecting Exchange Online functionality. The current cause of impact is still under investigation.
  • Next update by: Friday, January 25, 2019, at 2:00 PM UTC
1/25/2019 12:00:40 PM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’re gathering detailed forensics within the affected infrastructure to isolate the cause of the connection time outs and identify steps to remediate impact.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers are in a degraded state, affecting Exchange Online functionality.
  • Next update by: Friday, January 25, 2019, at 12:00 PM UTC
1/25/2019 10:46:32 AM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: Our telemetry data is indicating connection time outs within the Exchange authentication infrastructure, resulting in impact to the service. We’re looking into the Domain Controller logs to understand the cause and remediate impact.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers were in a degraded state, affecting Exchange Online functionality.
  • Next update by: Friday, January 25, 2019, at 11:00 AM UTC
1/25/2019 9:48:24 AM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users were receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’ve confirmed that the Exchange Online service is healthy and is operating within normal tolerances. We’ll continue to monitor the infrastructure to ensure that the service remains healthy.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers were in a degraded state, affecting Exchange Online functionality.
  • Next update by: Friday, January 25, 2019, at 5:00 PM UTC
1/25/2019 1:35:09 AM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’ve confirmed that the queued email has been successfully processed. We will continue monitoring the service throughout the business day to ensure the service continues to operate normally.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers are in a degraded state, affecting Exchange Online functionality.
  • Next update by: Friday, January 25, 2019, at 9:00 AM UTC
1/25/2019 12:03:11 AM
  • Title: Can’t access email
  • User Impact: Users are unable to connect to the Exchange Online service via multiple protocols.
  • More info: As a result of this issue, users are receiving an error message indicating the number of concurrent connections has exceeded a limit when they attempt to send and receive email.
  • Current status: We’re observing continued successful delivery of queued email and are closely monitoring the service health. We will continue monitoring the service through the business day to ensure the service continues to operate normally.
  • Scope of impact: Impact is specific to users who are served through the affected infrastructure.
  • Start time: Thursday, January 24, 2019, at 9:00 AM UTC
  • Preliminary root cause: A subset of our managed Domain Controllers are in a degraded state, affecting Exchange Online functionality.
  • Next update by: Friday, January 25, 2019, at 12:30 AM UTC
28/1/19

Il passaggio a EX172564

Nel corso del fine settimana Microsoft ha pubblicato ulteriori sviluppi sulla questione, e nel corso della notte tra il 25 e 26 gennaio, ha “chiuso” il capitolo EX172491 per passare al nuovo EX172564, interruzione di servizio che riguarda un’Europa ancora ferita dall’anomalia che l’ha colpita qualche giorno prima. Intorno alle 2 del mattino del 26/1 è stata pubblicata la nota all’interno della Dashboard amministrativa:

This is a continuation of EX172491. We’re targeting this communication specifically to customers who have experienced more significant impact in an effort to provide more detail, this communication will replace EX172491 on your dashboard. We understand that our initial analysis of this incident did not accurately capture the full scope of impact you have experienced throughout the duration of the incident.

Through our initial investigation, we identified that some Domain Controllers (DC) in the environment had become unresponsive. We took actions to restore service to the affected DC’s and implemented a secondary fix to restore service. After completing those actions, we received reports that users were able to access the Exchange Online service and that users were beginning to receive their messages that had been sent during the Exchange Online outage.

We want to ensure that you are receiving the most accurate updates related to your impact and we’re committed to keeping this as our highest priority until the root cause has been fully understood. We apologize that the user impact on our previous Service Health Dashboard post did not correctly convey the impact that your users are experiencing.

Scope of impact: Impact is specific to users located in Europe that are served through the affected infrastructure.

Seguiranno ulteriori aggiornamenti nella prima mattinata, per poi arrivare alle 10 circa (ora italiana) di sabato con la preparazione di nuovi DC in grado di servire l’Europa e poter sopportare ogni richiesta in arrivo dai client utenti, seguiti da successivi passaggi fino al completamento del rilascio di una soluzione più definitiva:

The deployment of the additional Domain Controllers (DC) is currently at approximately 12.5 percent. We’ve implemented the configuration change to a portion of the affected infrastructure and will monitor the environment to ensure that the connection time-outs have reduced. We’ve identified additional mitigation actions and enabled them to help prevent this issue in the future. We’re continuing in our efforts to enable additional logging.

2019-01-27 04:12 (UTC): The process in which we are adding additional domain controllers to the environment requires that the domain controllers are deployed in batches. Our third batch is still being deployed and is progressing as expected; however, this means that our deployment status remains at 50 percent.

2019-01-27 08:46 (UTC): The third batch of domain controllers is continuing to deploy as expected and we’re monitoring its progress. We’ll begin deployment of the fourth batch of domain controllers once the third batch has completed. As deployment progresses users will begin to see remediation.

2019-01-27 13:01 (UTC): We’ve completed 58 percent of the deployment of domain controllers and the third phase is progressing as anticipated. The fourth phase of deployment of domain controllers is expected to begin in approximately six hours.

2019-01-28 04:45 (UTC): We’ve completed our deployment of domain controllers and we’re performing our final validation tests to ensure all systems are functioning as expected. Additionally, the configuration change to reduce the time-outs has been applied throughout the affected infrastructure. Our current data and testing indicates that the service is maintaining optimal levels and we’ll closely monitor any changes in load or performance to prevent any additional impact.

Allo stato attuale ci si trova quindi in una situazione ormai stabile e sotto costante monitoraggio (2019-01-28 08:48 (UTC)):

We’re continuing to monitor the service now that the configuration changes have propagated throughout the environment. We’ll continue to monitor the service throughout the working day to ensure that the improvement work we’ve done has remediated impact.

Continuerò ad aggiornare l’articolo quando ci saranno ulteriori sviluppi in merito, per il momento passo, chiudo e spero che il fine settimana possa avere un risvolto più positivo rispetto a questo enorme disservizio mai così pesante per ciò che riguarda la storia del servizio.

Condividi l'articolo con i tuoi contatti: