Logo
UrbanPiper – Status
Operational
Updated: IST
COMPONENTS
MAINTENANCE
ISSUE HISTORY
Components
Atlas Portal
Operational
icon
Hub
icon
Prime
icon
Meraki
icon
POS Integration
icon
Auth Service
Operational
icon
Maintenance
No upcoming or ongoing maintenances reported
Issue History
No issues reported.
No issues reported.
No issues reported.
Not receiving orders on Prime/POS
Investigating

We have identified a performance issue in our prime service, which has led to order cancellations. Our team is actively investigating the problem to restore normal functionality. We apologize for any inconvenience and appreciate your patience.

IST
Monitoring

The team has deployed a fix, and all systems are now operational and performing optimally. We will continue to monitor the situation closely.

IST
Resolved

The issue has been resolved, and the team will continue to monitor the situation.


RCA

At 2230 hrs we had a regular, scheduled deployment of changes to our core platform services. This was a planned deployment to enable the release of some important features — and one of them related to a fix for our data services, which keeps our transactional and analytical data in sync.

Post deployment all systems and infra checks passed, but we were alerted of an issue where merchants were reporting seeing issues with the orders being relayed to the stores. Our first instinct was to make sure that we weren't missing out on spotting any infra related issue. However, soon enough we received clearer inputs relating to the date-time value associated with orders.

From there, we were quickly able to identify the source of the issue in the latest release and initiated a rollback. However, this rollback process took longer than usual (~25-30 mins) and this was something the team wasn't prepared for as the need for rollbacks in the last 2-3 yrs hasn't arisen and any inherent functional delays weren't looked into. However, once the rollback completed the issue got resolved and things stabilised quickly.

IST
Not receiving orders on Prime/POS
Investigating

We have identified a performance issue in our prime service, which has led to order cancellations. Our team is actively investigating the problem to restore normal functionality. We apologize for any inconvenience and appreciate your patience.

IST
Monitoring

The team has deployed a fix, and all systems are now operational and performing optimally. We will continue to monitor the situation closely.

IST
Resolved

The issue has been resolved, and the team will continue to monitor the situation.


RCA

At 2230 hrs we had a regular, scheduled deployment of changes to our core platform services. This was a planned deployment to enable the release of some important features — and one of them related to a fix for our data services, which keeps our transactional and analytical data in sync.

Post deployment all systems and infra checks passed, but we were alerted of an issue where merchants were reporting seeing issues with the orders being relayed to the stores. Our first instinct was to make sure that we weren't missing out on spotting any infra related issue. However, soon enough we received clearer inputs relating to the date-time value associated with orders.

From there, we were quickly able to identify the source of the issue in the latest release and initiated a rollback. However, this rollback process took longer than usual (~25-30 mins) and this was something the team wasn't prepared for as the need for rollbacks in the last 2-3 yrs hasn't arisen and any inherent functional delays weren't looked into. However, once the rollback completed the issue got resolved and things stabilised quickly.

IST
Increased latency
Investigating

We have identified a performance degradation in one of our databases, resulting in higher latency in our application. Our team is investigating the issue to restore normal performance. We apologize for any inconvenience and appreciate your patience.

IST
Monitoring

The latency has decreased, and all systems are operational and performing optimally. The team continues to monitor the issue.

IST
Resolved

The issue has been resolved, and the team will continue to monitor the situation.

IST
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
Atlas Login Issue
Investigating

Several users are observing errors while accessing atlas application. The team is reviewing the issue.

IST
Resolved

The issue has been resolved, the application is accessible. The team will share an RCA to all the stakeholders.

IST
Hub - Increased latency
Investigating

We are presently noticing heightened latency from certain application servers, which is impacting workflows related to orders. Our team is actively investigating this issue.

IST
Monitoring

We are witnessing a recovery in services, and performance is now optimal. The team is closely monitoring the situation.

IST
Resolved

The issue has been resolved, and the team will provide a RCA to all stakeholders. We apologize for any inconvenience caused.
..........................................................................
RCA

Yesterday at 1934 hrs our systems identified an anomaly on the latency times being reported for a few API clusters. While all indicators on the shared infra — databases, caches, queues — were healthy, we could see that some endpoints had started degrading in performance. Where we usually have response times < 300 ms, we were seeing the same creep up to 2000-3000 ms (2-3 secs).

However, as we couldn't find any clear indicator to give us clarity about we also reached out to AWS Support team to take their help in identifying any underlying infra issue.

By 2030 hrs, we could see the degraded performance had started affecting all our clusters. Along with the degradation, we could also see that the failure rate on some of the endpoints had gone up. We were able to control things a bit more by horizontally scaling our API clusters to 3x their planned capacity and that helped keep things in check.

Around 2100 hrs while checking the secondary metrics on our infra, we spotted that the swap usage on the primary database had crept up considerably. As we have never faced any issues on swap usage in the past, there weren't any active alerts on this. The way the API performance was degrading without any other primary metrics getting affected, it made sense to us that the increase in swap could be the root cause.
To reset the swap, we initiated a DB reboot at 2104 hrs. The DB completed the process in 15-20 secs, and we were able to immediately see that the response times on API clusters started recovering. After observing the system for another 30 mins, we were able to signoff on the incident.

To the best of our understanding, the root cause for this incident was tied to the swap usage on our primary DB. This is a novel scenario for which we had no precedent, but we will be working actively to deploy measures that allow us to avoid this issue every recurring.

IST
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
Hub API - Alerts
Investigating

The team is observing several alerts for Hub API's, the issue is being investigated.

IST
Monitoring

It appears that there was a momentary disruption in the connections between our virtual cloud clusters. It auto-restored and things are back in order.

IST
Resolved

The issue has been resolved, and an RCA will be provided to all the stakeholders.
....................................................................
RCA
At 0949 hrs IST today, we started getting alerts indicating that connectivity to the database system from our API clusters were impacted for the Hub related workflows.
All indicators in the meantime were stable for our other services.

In the past, we have seen minor disruptions in connectivity that occur sporadically but haven't seen anything prevail for this duration.

The team responded to the alerts by 0955 hrs IST and was working to establish the root cause and seeing if there were alternate routes to re-establish the connectivity.

While diagnosing the errors, at 0959 hrs the connectivity recovered and all systems were back to stable operational mode with all backlog operations cleared inside a minute. Given that all our DB health metrics were good during the incident, a network level connectivity issue is the most likely suspect — which would require further investigation along with our hosting provider - AWS.

IST
No issues reported.
No issues reported.
No issues reported.
No issues reported.
No issues reported.
Powered by
logo