This section is concerned with the operation of a factcast server and related infrastructure plus some helpful experiences from the ops sector to maybe learn from.
This the multi-page printable view of this section. Click here to print.
Ops
1 - Metrics
Being a regular Spring Boot 2+ application, the FactCast Server uses micrometer.io as its metrics emitting/collecting solution. In order to get started collecting the metrics FactCast Server emits, you’ll need to choose a backend/store for your metrics. Micrometer has lots of prebuilt bindings to choose from. Please refer to the respective documentation in the Setup section of the micrometer docs.
When it comes to metrics, you’ll have to know what you’re looking for. There are
- Server metrics in FactCast Server as well as
- Client metrics in the factcast client and additionally in the
- factus client library.
We’re focussing on Server metrics here.
Metric namespaces and their organization
At the time of writing, there are six namespaces exposed:
- factcast.server.timer
- factcast.server.meter
- factcast.store.timer
- factcast.ui.timer
- factcast.store.meter
- factcast.registry.timer
- factcast.registry.meter
Depending on your micrometer binding, you may see a slightly different spelling in your data (like ' factcast_store_timer`, if your datasource has a special meaning for the ‘.’-character)
Furthermore, metrics in operations are automatically tagged with
- an operation name
- a store name (‘pgsql’ currently) and
- an exception tag (‘None’ if unset).
Existing metrics
There are a bunch of metrics already emitted in the server. These metrics can be grouped by type:
- Timers (collecting durations of code execution)
- Meters (collecting metric events, for example, occurrences of errors)
As this list is continuously growing, we cannot guarantee the documentation’s completeness. If you want to see the current list of operations, please look at StoreMetrics.java , RegistryMetrics.java , ServerMetrics.java, or UIMetrics.java respectively.
At the time of writing (>0.9.9), the metrics exposed by the namespaces group factcast.server are:
| operation | type | description | 
|---|---|---|
| handshake | timer | Duration of the initial handshake. | 
| factsSent | meter | Number of facts sent to subscribed clients. | 
| bytesSent | meter | Amount of facts (in bytes) sent to subscribed clients. | 
At the time of writing (0.7.6), the metrics exposed by the namespaces group factcast.store are:
| operation | type | description | 
|---|---|---|
| publish | timer | Time to publish (write) a fact or a list of facts sent by the client. Ref: concepts | 
| subscribe-follow | timer | Time to create and return a follow subscription (not the actual stream of facts). Ref: concepts | 
| subscribe-catchup | timer | Time to create and return a catchup subscription (not the actual stream of facts). Ref: concepts | 
| fetchById | timer | Time to get a fact from a given ID. | 
| serialOf | timer | Time to get the serial of a fact. | 
| enumerateNamespaces | timer | Time to process namespaces enumeration. | 
| enumerateTypes | timer | Time to process types enumeration. | 
| getStateFor | timer | Time to get the latest state token for a given fact specification. The state represents the serial of the last fact matching the specifications, and is used by the client to determine whether a fact stream has been updated at a given point in time. Relevant for optimistic locking. Ref: optimistic locking | 
| publishIfUnchanged | timer | Time to check against the given state token and possibly publish (write) a fact or a list of facts sent by the client. Ref: optimistic locking | 
| getSnapshot | timer | Time to read a snapshot from the cache. Ref: snapshots | 
| setSnapshot | timer | Time to create/update a snapshot from the cache. Ref: snapshots | 
| clearSnapshot | timer | Time to delete a snapshot from the cache. Ref: snapshots | 
| compactSnapshotCache | timer | Time to delete old entries from the snapshot cache. Ref: snapshots | 
| invalidateStateToken | timer | Time to invalidate the state token used for optimistic locking. The client can abort the transaction and let the server invalidate the token used for consistency. Ref: optimistic locking | 
| notifyRoundTripLatency | timer | Time it takes for a notify on the database to be echoed back to the listener (roundtrip). | 
| catchupFact | meter | Counts the number of facts returned by a catchup subscription or catchup part of a follow subscription request (e.g. Factus managed projections) managed by the EventStore. Ref: concepts | 
| catchupTransformationRatio | meter | [deprecated] Percentage of facts transformed (downcasted/upcasted) by the server in response to a subscribed client. Useful for debugging the amount of overhead due to transforming, for subscription returning a significant amount of facts. Ref: transformation | 
| missedRoundtrip | meter | If inactive for more than a configured interval ( factcast.store.factNotificationBlockingWaitTimeInMillis), the server validates the health of the database connection. For this purpose it sends an internal notification to the database and waits to receive back an answer in the interval defined byfactcast.store.factNotificationMaxRoundTripLatencyInMillis. This metric counts the number of notifications sent without an answer from the database. | 
| snapshotsCompacted | meter | Counts the number of old snapshots deleted. This runs as a dedicated scheduled job, configured by factcast.store.snapshotCacheCompactCron.Ref: snapshots | 
| tailIndices | meter | Counts the number of tail indices being present after tail index maintenance. They have a “state” tag which can be used to distinguish between valid/invalid ones and they carry a “maintenance” tag which can be either skipped or executed and reflects whether maintenance was actually executed due to ongoing index operations. | 
At the time of writing (0.4.3), the metrics exposed by the namespaces group factcast.registry are:
| operation | type | description | 
|---|---|---|
| transformEvent | timer | Time to transform (upcast/downcast) a single fact. Ref: transformation | 
| fetchRegistryFile | timer | Time to retrieve a file from the schema registry. Ref: facts validation | 
| refreshRegistry | timer | Time to execute the schema registry refresh, in order to get the latest schema and transformation updates. | 
| compactTransformationCache | timer | Time to delete old entries from the transformation cache. | 
| transformationCache-hit | meter | Counts the number of hits from the transformation cache. | 
| transformationCache-miss | meter | Counts the number of misses from the transformation cache. | 
| missingTransformationInformation | meter | Counts the number of times that the server was not able to find transformation information from the schema registry. | 
| transformationConflict | meter | Counts the number of conflicts encountered by the server during schema registry update, which is caused by trying to change an existing transformation. | 
| registryFileFetchFailed | meter | Counts the number of times that the server was not able to get a json file from the schema registry. | 
| schemaRegistryUnavailable | meter | Counts the number of times that the server was unable to reach the schema registry. | 
| transformationFailed | meter | Counts the number of times that the server failed to transform a fact, using downcasting/upcasting scripts. | 
| schemaConflict | meter | Counts the number of conflicts detected by the server on the facts schema returned by the schema registry. | 
| factValidationFailed | meter | Counts the number of times that the server failed to validate a fact, that is attempted to be published, against the schema registry. | 
| schemaMissing | meter | Counts the number of times that the server detected a schema missing from the schema registry. | 
| schemaUpdateFailure | meter | Counts the number of times that the server was unable to update its schema definition from the schema registry, while fetching the initial state of the registry or during refresh. | 
gRPC Metrics
If you’re looking for remote calls and their execution times (including marshalling/de-marshalling from protobuf), you can have a look at the metrics automatically added by the gRPC library that we use. The relevant namespaces are:
- grpcServerRequestsReceivedand
- grpcServerResponsesSent
These automatically added metrics only focus on service methods defined in the protocol buffer specs. Since a gRPC remote call triggers not everything we want to measure, we introduced additional metrics. When comparing, for instance, the automatically added durations of gRPC vs. the ‘factcast.store.duration’, you will find a subtle difference. The reason for this is that instead of including the gRPC overhead, we chose to only measure the actual invocations on the FactStore/TokenStore implementation. Depending on your needs, you may want to focus on one or the other.
Executor Metrics
Micrometer provides an integration to monitor the default thread pool executor created by Spring Boot.
Under the same namespace executor.*, we publish metrics for our own thread pool executors used inside FactCast.
You can distinguish them by the name tag. Currently, these are:
- subscription-factory- used for incoming new subscriptions
- parallel-transformation- used for batch transformation of buffered facts
- paged-catchup- used for buffered transformation while using the paged catchup strategy
- transformation-cache- used for inserting/updating entries in the transformation cache (only if you use persisted cache)
- pg-listener- used by the Guava EventBus that receives signals from the PostgreSQL
- telemetry- used by the Guava EventBus that receives signals from the FactCast Server ( see telemetry)
See https://micrometer.io/docs/ref/jvm for more information.
UI Metrics
Special metrics for the FactCast-Server UI are published via factcast.ui.timer namespace.
| operation | type | description | 
|---|---|---|
| plugin-execution | timer | Time to execute a specific plugin for one fact. | 
| fact-processing | timer | Overall time to process one fact. This includes execution of every plugin, parsing JSON payload and building the final representation model. | 
Additionally, all methods of the org.factcast.server.ui.adapter.FactRepositoryImpl are measured time-wise, and can be
visualized via class and method
dimension.
2 - Telemetry
Starting from factcast version 0.7.9, you can extend your server implementation to listen for internal telemetry events. This can be useful for monitoring and debugging purposes.
The telemetry events are emitted using a dedicated internal Guava EventBus.
Subscription lifecycle events
Currently, the factcast-store module emits an event on each phase of the subscription lifecycle (see
org.factcast.store.internal.telemetry.PgStoreTelemetry):
- PgStoreTelemetry.Connectemitted whenever a client connects to the factcast server
- PgStoreTelemetry.Catchupemitted whenever the subscription catches up to the current state of the store
- PgStoreTelemetry.Followemitted whenever the subscription started consuming live events
- PgStoreTelemetry.Closeemitted whenever the client disconnects from the factcast server
- PgStoreTelemetry.Completeemitted whenever the subscription completed its lifecycle
Each emitted event contains a request, which holds the client’s request details.
How to listen to telemetry events
It boils down to implementing a listener that is able to consume telemetry events, through
com.google.common.eventbus.Subscribe annotated methods, and registering it via the PgStoreTelemetry bean.
Here is an example:
import com.google.common.eventbus.Subscribe;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.factcast.store.internal.telemetry.PgStoreTelemetry;
@RequiredArgsConstructor
@Slf4j
public class MyTelemetryListener {
  public MyTelemetryListener(PgStoreTelemetry telemetry) {
    telemetry.register(this);
  }
  @Subscribe
  public void on(PgStoreTelemetry.Connect signal) {
    log.info("FactStreamTelemetry Connect: {}", signal.request());
  }
  @Subscribe
  public void on(PgStoreTelemetry.Close signal) {
    log.info("FactStreamTelemetry Close: {}", signal.request());
  }
}
You can check out the full example in the factcast-example-server-telemetry
module. That module contains a simple example of how to listen to each subscription lifecycle event, to log the request
details and maintaining a list of following subscriptions, which can be read through the actuator /info endpoint.
3 - PostgreSQL
General Tips
Optimize GIN indexes updates
While GIN Indexes make querying jsonb faster they are also expensive to update. Especially because a single change can
cause the update of multiple index entries. In order to keep the overhead on write and update statements low, postgres
per default enables the fastupdate setting which defers the update of the index and instead gathers changes to execute
them all at once. This update happens when:
- the gin_pending_list_limitis reached (default 4MB)
- the gin_clean_pending_listfunction is called
- at the end of the autovacuum of the table
This can cause the query whose change eventually fills the gin_pending_list_limit to be a lot slower than usual. If
this kind of behavior is observed it might make sense to consider to:
- reduce the gin_pending_list_limit-> more frequent, smaller flushes
- increase the limit and do manual flushes outside of workload
- turn off fastupdate
- let autovacuum run more often or manually call the clean operation
Autoanalyse & Autovacuum settings
Postgres has a built-in mechanism to keep the statistics up to date and to clean up dead tuples, called Automatic Vacuuming.
In most cases, it might be necessary to adjust the default autovacuum settings to better fit the workload and ensure
a more efficient execution of the process:
# disable autovacuum schedule based on scale factor
autovacuum_vacuum_scale_factor:0
autovacuum_analyze_scale_factor:0
# set thresholds based on approximate number of facts inserted
autovacuum_vacuum_threshold:<number of new facts each month>
autovacuum_analyze_threshold:<number of new facts each week>
When used as AWS RDS
AWS RDS Configuration
Most of the time, the default RDS configuration of PostgreSQL is sufficient. However, in some cases, it might be necessary to adjust some settings in the RDS Parameter Groups to improve performance. The following settings are recommended for FactCast instances running on production stages:
# hands over concurrency considerations to kernel
effective_io_concurrency:0
# tune accordingly, consider roughly 100mb running on a db.r5.2xlarge RDS instance
work_mem:100000
# the followings might vary, depending on your non-functional requirements
log_statement:'none'
log_min_duration_statement:500
default_statistics_target:100
# allows to deploy major version updates via blue/green deployments, significantly reducing downtime
rds.logical_replication:'1'
4 - General Operations Tips
Conditional execution date2serial migration
In FactCast v0.7.1, a new UI feature was introduced to allow filtering of events based on their publishing date. For
this purpose, the date2serial mapping table was introduced in the schema.
A Liquibase changeset takes care of creating and populating the date2serial table, but it is not executed when the
store contains more than 10 million events. This is to prevent the migration from taking too long on larger setups.
As mentioned in the changeset comments, it is suggested to run the changeset manually in such cases. The changeset can
be found in the factcast-store module under src/main/resources/db/changelog/factcast/issue2479/date2serial_for_existing_events.sql.
Optimize GIN indexes updates
While GIN Indexes make querying jsonb faster, they are also expensive to update. Especially because a single change can
cause the update of multiple index entries. To keep the overhead on write and update statements low, postgres per
default enables the fastupdate setting which defers the update of the index and instead gathers changes to execute
them all at once. This update happens:
- when the gin_pending_list_limitis reached (default 4MB)
- when the gin_clean_pending_listfunction is called
- at the end of an autovacuum operation on a table
However, there can be certain disadvantages of fastupdate on GIN indexes
- query performance can suffer significantly when looking through both, the main index and pending list
- when reaching the size limits, in-query cleanups can block other queries
This can cause queries to be a lot slower than usual which we have observed in production setups. In general, if this kind of behavior is observed, it might make sense to consider to:
- reduce the gin_pending_list_limit-> more frequent, smaller flushes
- increase the limit and do manual flushes outside of workload
- turn off fastupdate
- let autovacuum run more often or manually call the clean operation
For now, we have decided to disable the fastupdate setting via src/main/resources/db/changelog/factcast/issue3755/disable_fast_update.sql
Please note that flushing the pending list as part of disabling the fastupdate setting could in theory block any other
query. This is why this change set is not executed automatically if the attached condition senses a larger setup
(> 10 million events). In this case please execute the change set manually to disable the fastupdate setting.
Autoanalyse & Autovacuum settings
Postgres has a built-in mechanism to keep the statistics up to date and to clean up dead tuples, called Automatic Vacuuming.
In most cases, it might be necessary to adjust the default autovacuum settings to better fit the workload and ensure
a more efficient execution of the process:
# disable autovacuum schedule based on scale factor
autovacuum_vacuum_scale_factor:0
autovacuum_analyze_scale_factor:0
# set thresholds based on approximate number of facts inserted
autovacuum_vacuum_threshold:<number of new facts each month>
autovacuum_analyze_threshold:<number of new facts each week>
AWS RDS Configuration
Most of the time, the default RDS configuration of PostgreSQL is sufficient. However, in some cases, it might be necessary to adjust some settings in the RDS Parameter Groups to improve performance. The following settings are recommended for FactCast instances running on production stages:
# hands over concurrency considerations to kernel
effective_io_concurrency:0
# tune accordingly, consider roughly 100mb running on a db.r5.2xlarge RDS instance
work_mem:100000
# the followings might vary, depending on your non-functional requirements
log_statement:'none'
log_min_duration_statement:500
default_statistics_target:100