最近动态
How could a relationship become so toxic...
Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing...
Take clear notes of what ideas you had, which tests you ran, and the results you saw.70 Particularly when you are dealing with more complicated and drawn-out cases, this documentation may be crucial in helping you remember exactly what happened and prevent having to repeat these steps.71 If you performed active testing by changing a system—for instance by giving...
Ways in which things go right are special cases of the ways in which things go wrong. —— John Allspaw
—— Google SRE Book, Chapter 12
—— Google SRE Book, Chapter 12
Formally, we can think of the troubleshooting process as an application of the hypothetico-deductive method:59 given a set of observations about a system and a theoretical basis for understanding system behavior, we iteratively hypothesize potential causes for the failure and try to test those hypotheses.
—— Google SRE Book, Chapter 12
—— Google SRE Book, Chapter 12
Ensuring that the cost of maintenance scales sublinearly with the size of the service is key to making monitoring (and all sustaining operations work) maintainable. This theme recurs in all SRE work, as SREs work to scale all aspects of their work to the global scale.
—— Google SRE Book, Chapter 10.
—— Google SRE Book, Chapter 10.
Many important services in Google, e.g., Search, Ads, and Gmail, have dedicated teams of SREs responsible for the performance and reliability of these services. Thus, SREs are on-call for the services they support. The SRE teams are quite different from purely operational teams in that they place heavy emphasis on the use of engineering to approach problems....
As a system grows more complex, the separation of responsibility between APIs and between binaries becomes increasingly important. This is a direct analogy to object-oriented class design: just as it is understood that it is poor practice to write a "grab bag" class that contains unrelated functions, it is also poor practice to create and put into production...
You might read the examples in this chapter and decide that you need to be Google-scale before you have anything to do with automation whatsoever. This is untrue, for two reasons: automation provides more than just time-saving, so it’s worth implementing in more cases than a simple time-expended versus time-saved calculation might suggest. But the approach...
Automation code, like unit test code, dies when the maintaining team isn’t obsessive about keeping the code in sync with the codebase it covers. The world changes around the code: the DNS team adds new configuration options, the storage team changes their package names, and the networking team needs to support new devices.
—— Google SRE Book, Chapter 7
—— Google SRE Book, Chapter 7
Pages with rote, algorithmic responses should be a red flag. Unwillingness on the part of your team to automate such pages implies that the team lacks confidence that they can clean up their technical debt. This is a major problem worth escalating.
—— Google SRE Book, Chapter 6
—— Google SRE Book, Chapter 6
Note that in a multilayered system, one person’s symptom is another person’s cause. For example, suppose that a database’s performance is slow. Slow database reads are a symptom for the database SRE who detects them. However, for the frontend SRE observing a slow website, the same slow database reads are a cause. Therefore, white-box monitoring is sometimes...
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.
—— Google SRE Book, Chapter 6
—— Google SRE Book, Chapter 6
Finally got strapi updated to the newest stable version...
If we all commit to eliminate a bit of toil each week with some good engineering, we’ll steadily clean up our services, and we can shift our collective efforts to engineering for scale, architecting the next generation of services, and building cross-SRE toolchains. Let’s invent more, and toil less.
——Google SRE Book, Chapter 5.
——Google SRE Book, Chapter 5.
Landing at an SE/SRE summer internship in the US
Greetings from Bittersweet. It has been a long time since I posted anything on the blog last time. It's not like that I don't want to write anything, but rather I'm not good at putting my thoughts into words... I really should practice that more often. After months of searching, coding, and intervi…
El Psy Kongroo
Thank you, Dropbox, for taking me in as a Dropboxer :p
A late Happy Chinese New Year to everyone!
Wish you a happy 2021 filled with good lucks :p
(just wanna post something to prove I'm alive out there lol)
(I might start to write an algorithm series very soon, well, hopefully...)
Wish you a happy 2021 filled with good lucks :p
(just wanna post something to prove I'm alive out there lol)
(I might start to write an algorithm series very soon, well, hopefully...)
One of the ways to practice mindfulness of emotions is summarized in the acronym RAIN: recognize, accept and allow, investigate and non-personal or nurture. We recognize what emotion is present, allow it to be, investigate how it feels in the body, and don't take it personally! Our emotions are not us; they are like weather passing through. They do not need to...