MFV Tech Talk #3 - Service Operation | Kitto & Leon

MFV Tech Talk #3 - Service Operation | Kitto & Leon

According to IT Infrastructure Library (ITIL), service operation is one in five stages of an IT service lifecycle as in the following image. Each stage has a purpose and several processes defined and collected from many sources that were proven in reality to achieve this purpose.
MFV Tech Talk #3 - Service Operation | Kitto & Leon

1. Overview

1.1 Definition

According to IT Infrastructure Library(ITIL), service operation is one in five stages of an IT service lifecycle as in the following image. Each stage has a purpose and several processes defined and collected from many sources that were proven in reality to achieve this purpose.

Service operation makes sure that IT services are delivered effectively and efficiently. ITIL defines service operations including some main processes:

  • Event management
  • Incident management
  • Request fulfillment
  • Access management
  • Problem management
  • Facilities management
  • Application management
  • Technical management

1.2 How to apply it in reality

The list of processes in ITIL operation management was built and collected from many simple and complicated cases. It will be bulky if you want to apply all of them to a simple case or a small product with a small team. So the answer to the question “how to apply it ?” is “it depends on the case, the context”. We don’t need to use all the processes. In MF, we apply three processes and it’s enough for the current status of the product and the company:

  • Event management
  • Incident management
  • Request fulfillment

Some factors will be helpful when choosing your processes:

  • Development team: size, knowledge, etc
  • Product size and requirements and business

1.3 Engineering viewpoint

ITIL has many definitions that help you understand the high level of each stage in IT service lifecycle. But for IT engineers, especially those who want to apply it to IT services and build it by hand, we need to see it from a viewpoint familiar with engineering knowledge. In MF, we promote a viewpoint for easy to look at and apply service operation:

System monitoring makes sure you collect necessary events and monitor them all the time. While the operation process will lead you to action with all the monitoring information(event, alert based on event).

With this viewpoint. You can implement it step-by-step.

2. System monitoring

2.1 How monitoring work

Every monitoring tool was built with the agent-server model:

  • Agent: will be injected anywhere we want to monitor to collect data that was configured
  • Server: Receive data from agents by pulling or pushing patterns and do all the post-processing
  • Some monitoring tools allow to pushing of data directly with an exposed API. It will be useful in case of short-term services like serverless

In Money Forward, we use Datadog as the main monitoring tool and monitor service work on AWS.

2.2 Three pillars of observability

Every monitoring tool works around three main objects: log, trace, and metric as known as three pillars in a big pillar-story of monitoring.

With each object, the monitoring tool provides some functions:

  • Collect
  • Transform
  • Archive
  • Index
  • Visualize

2.3 Log

Log needs to be:

  • Design
  • Collect and transform
  • Archive
  • Index, explore, and analyze
  • Some best practices and processing with log:

2.3.1 Structure and format

As recommended everywhere in the tech world, we should use a structured log. A structured log is a type of machine-readable information. It’s easy to read, process, and index by machine. In MF, we use JSON as the best practice log format with a good structure was built.

2.3.2 Common fields

A log needs to have some common fields to show us basic information. In MF log, we have some common fields:

  • Level(Info, Warning, Error)
  • Time
  • Message
  • UserID: this log belongs to which user
  • OfficeID(TenantID): In tenant application
  • AppID: Sometimes we use monolithic and have more than one application deploy in the same service. AppID is good for separating log from many applications
  • RequestID&TraceID: Bind them in a processing follow and we can filter to get logs of a request processing
  • Stacktrace: Good to see where an error was pulled out from the the code. It will be helpful for debugging error cases

2.3.3 RequestID and TraceID

RequestID and TraceID as mentioned above help us to trace all the logs belonging to a request or flow processing. In our experience, RequestID and TraceID should reveal not only an ID for request or flow but also many important information. For example in Accounting Cloud service, we have a structure of RequestID and TraceID:

  • YYYYMMDD: 8 chars for time information. Ex: 20221011
  • APPID: 2 chars for app id information
  • FeatureID: 4 char for feature id information

2.3.4 Collect log

Check on tech talk video at the end to see more about collecting log on AWS ECS and Serverless  22 and 23

2.4 Trace

Trace helps us to understand the full path a request takes in the application. It’s essential to trace the performance of an application over one or many services in the system.

Some steps need to be done with trace:

  • Apply trace to code
  • Collect trace
  • Explore and analyze trace
  • Archive

Some best practices with trace:

  • A reasonable number of spans in trace
  • Map between many services
  • Spans in trace must have tags and detail information for tracing purposes

After applying trace, we can build an overview of the system with a monitoring tool

To see more about how we can collect and manage trace, check techtalk video.

2.5 Metric

A metric is a measurement of a service captured at runtime. It helps us to understand the status of services in runtime.

Some best practices with metric:

Collect at least essential metrics: CPU, memory, error rate, latency, network, etc

Define and collect internal application metrics for your special purposes

To get more information, please check out my speach video at the end of this article. 

3. Operation Process | SLA/SLO/SLI | Monitoring Dashboard

All details will be elaborated within this video, please enjoy.



More like this

MFV Tech Talk #3 - Service Operation | Kitto & Leon
Apr 12, 2024

MFV Tech Talk #3 - Service Operation | Kitto & Leon