Keep your crash-free sessions up to 99%

Or how to indicate a good application

Michael Spitsin
10 min readAug 22, 2017

Today we will discuss fashion and famous theme: crash-free sessions in an Android application. So the first question, why 99%? It looks pretty big. And it seems that 99% is something beyond imagination. At least, it seems, that some companies or startups think so because I’ve heard from time to time on interviews things like: “We have very stable application, 98% crash free sessions, in next month we will make 99%”. I’m not trying to laugh on this guys, but I’ve heard such phrase one time with such proud (and not from business-man), and by the time, high level crash-free is not guaranteed for a good application, but every good application, from my point of view, must look toward 100% crash-free sessions. So let’s handle it and figure out, what, why and when?

99% magic number

So the number 99% itself is pretty big. For example, if I will have 99% discount for GTA V in the steam, I will think, I’m on the heaven. But we observe numbers related to their objectives. And in terms of stability, we have a significantly different result.

Let’s start with 90% percent of crash-free sessions. This number means that: if ten persons will run our application and will use it till closing or crashing it, only one of them will have a crash. One of ten persons. For example, a medium plane can accommodate approximately 180 people. If they all will use our application, 18 of them will have a crash. I think it is a very significant number. Of course, this is not mean that application is crap. But it isn’t for one of the best apps. 90% is a number of crashes for a medium application. It is not bad but is not so good. Just medium. (I agree, that can be exceptions even in such gradation)

Now let’s think about 99%. It is one person from one hundred. In our previous example, from 180 passengers of the plane, only 1 or 2 will have a crash. And it is a good correlation. Of course, I have no mathematical proof for choosing a correct number of appropriate crashes. But imagine, your application will have 10M installs, 10M people have run your application at least once. And now compare 1M of them will have a crash or 100 thousands.

Moreover, I sure that every programmer, that truly inspired by his/her work, must bother about crash-free sessions and looks toward 100% crash free. And every good programmer must be inspired by the project, or at least he/she must do his/her work professionally. So, by transition rule, we have next statement: every good programmer must care about crash-free sessions on its application. Of course, it’s very straightforward phrase, and there are a lot of projects with many team constructions. For example, there could be a team with the special programmer that only fixing bugs and nothing more. But it sounds to me very strange, since it seems like every programmer can make a crap in an application, and there is a special cleaner of that crap. :) Finally, 90% and 99% percent are just boundary points for approximate measuring bad, medium and good application.

I have 99%, so my application is perfect.

Hold on, my friend. Now let’s talk about next question: does 99% is surely mean that app is good. For example, our team’s application at some point in time had 99.7% crash free sessions, but still, I think it was and is not perfect. Crash free is only one of the requirements of a good stable application, but this is not only one measurement of goodness. There are others parameters like amount of ANR-s or performance of application…or consistency of data and UI.

We can have almost perfect in terms of crash-free sessions application, but it could be, for example, very slow. And there are many reasons of that, both client and server side. We need to figure out each reason. In a client do some testing and analytics of memory usage, lazy evaluations or computation in UI-thread and e.t.c. In terms of the server, we as clients need to analyze sizes of the results of the queries, evaluate how many queries we do and their weight.

Also, our application can have an ANR (Application Not Response) error, and it can be far more complex, than single crash stacktrace. Of course, there are always exceptions to some rules. But generally, when you have an exception, you have a stacktrace, you can go through it and find where is problem appears. You already have the exact starting point. When you have an ANR, you know nothing. Okay, it is not so bad, you know when it seems to happen. But you don’t know exact place, so you need to spend additional time to figure out how it could happen. I agree that with crashes you need to do the same process in mind, you need to find out, how it is possible, but still, in common, with crashes you can do it faster.

I can easily fix my crashes

There are tonnes of crashes that could be fixed right here and right now. Doing additional check and catching an exception are only two primal ways to eliminate of a crash. But if you rush and do it automatically you are risking to get stuck with unexpected application behavior: UI glitches, inconsistent data or UI, unpredictable toasts with the message “Oops! Something went wrong”. So, we often can get rid of crash very fast. At least we think so. But truth is, that we need to check every crash and realize, why it had happened. That’s why fixing of app exceptions starting to be slow.

It is not just “Oh! NPE! I just need to add null check”. True. Sometimes it was a mistake of a developer, who forgot to place null check predicate requested by specification, documentation, and e.t.c. But tonnes of times NPE happen for different reasons. For instance: you need to request the price of a house from the server and then print it on a screen. But at the moment of printing, you have a null price and NPE somehow happens. You can just place a null check and then print some placeholder like “unavailable” when there is no price. But question is, is it expected behavior from the server. Maybe the price of the house is required parameter and thus it must be not null. So this is a real server error, and it is not the fact that it should be fixed somehow in a client. But let’s dig deeper. Can we trust the server? There could be legacy code in that part of it and it isn’t the fact that bug will be fixed as soon as possible. So it will turn to kind of limitation that we are needed to live with. So, in that case, we need to do something on a client.

And there can be the situation, where both client and server works well at this stage, but the truth is that we had received null-price because we gave bad parameters or already was in the inconsistent state, so we need to step back and see what we did wrong earlier.

What can I do to keep my app in good health?

It’s a good question, I suppose. Because we all want that our applications were perfect, we all want that our working projects were good. It is an influence on our future career, on a relation with colleagues, business and on yourself mood. We want to keep our applications stable and fast, because we are true professionals, right?

So, there are common things we can do as prophylaxis for our applications:

Always measure

Crashes. I’m sure you have some crash statistic system. It can be google console crashes, Crashlytics, HockeyApp, may be old good self-written system, that sends crashes to your working email. If you have something like that, use it. Go to your service and see, what exception do have your application, how often it appears and on what type of phones. Measure. Measure their count and cost of fixing.

Bugs. There are cases when bugs are not the crashes, but just some inconsistent behavior, UI-state or simple toast “Oops! Something went wrong”. To figure out how it works, you need to always communicate with QA engineers. Or, if you have no testers, you need to always use your application on your own or gather feedback from relatives, friends, google play users and e.t.c. When you receive all feedback, you need to prioritize it and find out, what bugs you need to handle now, and what you will resolve after. Despite the fact that you work in a team and probably you have a team lead or person who is responsible to prioritize backlog (for example, product owner), you need to be engaged in the process. For example, you may have some additional info about some bugs, that will help to prioritize them correctly.

Performance. There are two known standard ways to analyze application performance: Device Monitor and Systrace. They very similar, but still differs from each other. Personally, for time measurement and comparison I prefer Device Monitor. So the common suggestion here is a and if you can spot some lags, then spend 5–10 minutes, open Device Monitor, record lagging spot of application and see, what stops UI thread and eats its resource. We know that timing is not only one performance measure. Next point will be also about performance.

UI complexity. Of course, we cannot create simplest UI with only one two EditTexts and one TextView for outputting a result. We have a complex system, I’m sure. But good UI must be simple, and this is not only for the user. I mean that good layout and views must be simple. More complex UI, more time it will take for inflation and creation. Even if UI is perfect for business, always check, maybe you have unnecessary nested layouts and something like that. Maybe your application is too critical to inflation, so it will be better to create it programmatically rather than in XML. Also, if you have a very complex layout and you want to keep it on XML, consider using ViewStubs or some asynchronous inflation.

Memory. Android Studio gives us two interesting tools for measuring memory and finding out is there a memory leak or not. First one is Memory Monitor. It shows us consumed memory by application and remained available memory on a special graph. You can observe it and see, for example, how fast memory growth. Thus, you can define spots in an application that consumes a critical amount of memory. The second tool is Memory Dump. In it, you can see details of consumed memory. Thus, you can recognize which objects were leaked.

I’m sure that this is not a full list of measurements. If you know any other important measures, please, write in comments and I will update the article.

Periodic prophylaxis

A simple measurement is not enough. When you sit at the project one day and decide: “Okay, let’s check all that stuff on it”. Then you will spend the whole day to measure memory consumption, performance and other metrics of good quality. And then forget about it for the next year. No, it will not work that way.

We, as good programmers must do it periodically. Moreover, for me, it is better to periodically inspect small amounts of apps potential problems, than spend whole day but progress on it significantly. Because in the first case, you can just fluently check problems and approximately understand do you need to fix them ASAP, or you can delay them for some time. As for me, I try to check crash-free sessions every new release (every two weeks approximately), bad performance reasons and memory consuming once for a couple of months, bugs as soon as Q/A gives them to me. Then I approximately determine is it critical or not in terms of current project situation and then talk with my team lead and choose, will we fix it now or later.

Be true investigator

As for me, it is really important to keep your skill increases. Thus, you can find new language features; SDK’s aspects; approaches to solve architectural and local problems. Of course, you don’t need to run and implement or use a new super function, you’ve read about yesterday, but you can plan it if it will really increase code stability, testability, readability and other abilities.

You need to always check, does your team code style is still appropriate, or it needed to be adjusted a little bit. If you (team) will not permanently update your code style (or check is it need to be changed), there is a chance that in some day you will find out that you have stale style. Your application grows, your architecture evolves, your approaches adjust, so there is a chance that you code style will also require tiny changes. But it is not the rule, just a possibility.

You need to always investigate new libraries and solution approaches for different problems. Keep reading books. Keep reading articles. Keep monitoring, for example, Android Arsenal. And always remain for your development small window of refactoring. Otherwise, you will be the slave of redundant and stale functionality maintenance.

Afterwords

I’ve not expected that I will write such a big article. I’ve expected to write a small inspiration speech, for the sake of increasing crash free sessions. But as appeared, that theme is deeper and I’ve decided to write “couple” of words more.

If you read to this place (or just scroll to here), let’s recap all above text:

  • Keep crash-free sessions up to 99% and try to increase it to 100%
  • High crash-free sessions is required for a good application but it is not guaranteed of goodness
  • Always check and measure your application to increase its quality
  • Always raise your skill and knowledge base, to invent or find new more appropriate ways to solve problems

I hope you’ve read the article and, if you did, I hope you like it. If so, press like button, and if you have something to add, or you don’t agree with some thoughts, or if you have something to say, please, leave a comment below. I will glad to answer to you and discuss this topic. Thank you.

--

--

Michael Spitsin

Love being creative to solve some problems with an simple and elegant ways