Skip to content

Commit

Permalink
operations: Add docs on compaction
Browse files Browse the repository at this point in the history
  • Loading branch information
ohsayan committed Apr 24, 2024
1 parent 2f8ef59 commit fd21340
Show file tree
Hide file tree
Showing 7 changed files with 29 additions and 11 deletions.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ To develop using Skytable and maintain your deployment you will want to learn ab
- [**Configuration**](system/configuration): Information to help you configure Skytable with custom settings such as custom ports, hosts, TLS, and etc.
- [**User management**](system/user-management): Information on access control, user and other administration features
- [**Global management**](system/global-management): Global settings management
- [**Data recovery**](system/recovery): Database recovery
- [**Operations**](system/operations): Learn about administration operations
- **Resources**:
- [**Useful links**](resources/useful-links): Links to helpful resources
- [**Migration**](resources/migration): For old our returning Skytable users who are coming from older versions
Expand Down
2 changes: 1 addition & 1 deletion docs/system/1.configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ To start the server with a configuration file, simply run `skyd --config <path t
Here's an explanation of all the keys:
- `system`:
- `mode`: set to either `dev` / `prod` mode. `prod` mode will generally make some things stricters (such as background services)
- `rs_window`: **This is a very important setting!** It is set to `300` by default and is called the "reliability service window" which ensures that if any changes are observed in `300` (or whatever value you set) seconds, then they reach the disk as soon as that time elapses. For example, in the default configuration the system checks for changes every 5 minutes and if there are any dataset changes, they are immediately synced. [Read more here](recovery#understanding-data-loss)
- `rs_window`: **This is a very important setting!** It is set to `300` by default and is called the "reliability service window" which ensures that if any changes are observed in `300` (or whatever value you set) seconds, then they reach the disk as soon as that time elapses. For example, in the default configuration the system checks for changes every 5 minutes and if there are any dataset changes, they are immediately synced. [Read more here](operations#understanding-data-loss)
- `auth`:
- `plugin`: this is the authentication plugin. we currently only have `pwd` that is a simple password based authentication system where the password is stored as an [`rcrypt` hash](https://github.com/ohsayan/rcrypt) on disk. More `plugin` options are set to be implemented for more advanced authentication, especially in enterprise settings
- `root_pass`: this is the root account password. **It must have atleast 16 characters**
Expand Down
2 changes: 1 addition & 1 deletion docs/system/3.global-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The following query returns an `Empty` response or an error code depending on th
SYSCTL REPORT STATUS
```

If you receive an error code, we recommend you to connect to the host and check logs. If the server has crashed, you may need to [recover the database](recovery).
If you receive an error code, we recommend you to connect to the host and check logs. If the server has crashed, you may need to [recover the database](operations#data-recovery).

## Inspecting all spaces

Expand Down
2 changes: 1 addition & 1 deletion docs/system/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ Here's an overview of the different administration guides:
- [**Configuration**](configuration): Understand how Skytable can be configured using command-line arguments, environment variables or a configuration file and what all configuration options are available
- [**User management**](user-management): Learn about account types, permissions and how you can manage multiple users
- [**Global management**](global-management): Learn how to check system health and manage the global state of your database instances
- [**Data recovery**](recovery): Understand what to do after a system crash and how to recover data if needed
- [**Operations**](operations): Understand administrator operations tasks such as backups, recovery and more
26 changes: 20 additions & 6 deletions docs/system/4.recovery.md → docs/system/operations.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,30 @@
---
id: recovery
title: Recovery
title: Operations
---

## Managing disk usage

Over time, as you continue to use your database your database files will grow in size, as you would expect. However, sometimes database files may grow beyond an efficient size resulting in high memory usage or slowdowns. To counter this, Skytable uses internal heuristics to determine when a database file is "larger than needed" and automatically compacts them at startup.

However, in some cases you may wish to perform a compaction regardless in order to reduce the file size. In order to do this you will have to run:

```sh
skyd compact
```

The server will then compact all files (even if a compaction wasn't triggered by internal heuristics) to their optimum size.

## Data recovery

In the unforeseen event that a power failure or other catastrophic system failure causes the database to crash, the Skytable server will fail to start normally. Usually it will exit with a nonzero code and an error message such as "journal-corrupted." In such cases, you will need to recover the journal(s) and/or any other corrupted file(s).

## Understanding data loss
### Understanding data loss

All DDL and DCL queries are immediately written to disk when they're run and hence usually no data loss will occur due to a runtime crash (unless a crash occurs in the middle of a disk write). On the other hand, DML queries are written in optimized delayed-durability batches, i.e when the engine determines that either there are too many pending changes or if too much memory is being used (alongside other factors). This however means that in the case of a runtime crash with pending changes, some of these changes may be lost.

This is why it is so important to tune the [`rs_window`] value or the "Reliability Service" window which ensures that irrespective of the number of changes, all changes will be flushed in that given duration. We're further working on supporting optimized immediate writes for DML queries (which however as expected would come with a significant performance penalty).

## Recovering database files
### Recovering database files

To repair the database, simply run this on the command line **in the working directory of the database**:

Expand All @@ -20,12 +33,13 @@ skyd repair
```
The recovery system will first create a full backup of the current data files in a subdirectory in the `backups/` directory. It will then go over each database file, try to detect any errors and make any approriate repairs.

## Important notes
### Important notes

- The recovery system is *very conservative* and will attempt to restore the database to the most recent working state. Any remaining data is deemed unreliable and not loaded
- Please ensure that you have sufficient disk space before attempting a repair
- The earlier in the file the corruption happens, the greater the amount of data lost

## Post recovery
### Post recovery

After running a repair operation, if a signficant amount of data loss has occurred (as reported by `skyd`) then we strongly recommend you to manually look through your datasets. The recovery process guarantees that the *restored data* is intact. If this failure resulted from power loss, in the future you may consider installing power backup systems if self-hosting or choosing a reliable cloud provider.

4 changes: 4 additions & 0 deletions docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,10 @@ module.exports = {
{
from: '/protocol/networking',
to: '/protocol/specification'
},
{
from: '/system/recovery',
to: '/system/operations#data-recovery'
}
]
}]
Expand Down
2 changes: 1 addition & 1 deletion sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ module.exports = {
"system/configuration",
"system/user-management",
"system/global-management",
"system/recovery",
"system/operations",
],
link: {
type: 'doc',
Expand Down

0 comments on commit fd21340

Please sign in to comment.