Integrating Vector with ClickHouse
Being able to analyze your logs in real time is critical for production applications. Have you ever wondered if ClickHouse is good at storing and analyzing log data? Just checkout Uber's experience with converting their logging infrastructure from ELK to ClickHouse.
This guide shows how to use the popular data pipeline Vector to tail an Nginx log file and send it to ClickHouse. The steps below would be similar for tailing any type of log file. We will assume you already have ClickHouse up and running and Vector installed (no need to start it yet though).
1. Create a database and table
Let's define a table to store the log events:
- 
We will start with a new database named nginxdb:
- 
For starters, we are just going to insert the entire log event as a single string. Obviously this is not a great format for performing analytics on the log data, but we will figure that part out below using materialized views. NoteThere is not really a need for a primary key yet, so that is why ORDER BY is set to tuple(). 
2. Configure Nginx
We certainly do not want to spend too much time explaining Nginx, but we also do not want to hide all the details, so in this step we will provide you with enough details to get Nginx logging configured.
- 
The following access_logproperty sends logs to/var/log/nginx/my_access.login the combined format. This value goes in thehttpsection of yournginx.conffile:
- 
Be sure to restart Nginx if you had to modify nginx.conf.
- 
Generate some log events in the access log by visiting pages on your web server. Logs in the combined format have the following format: 
3. Configure Vector
Vector collects, transforms and routes logs, metrics, and traces (referred to as sources) to lots of different vendors (referred to as sinks), including out-of-the-box compatibility with ClickHouse. Sources and sinks are defined in a configuration file named vector.toml.
- 
The following vector.toml defines a source of type file that tails the end of my_access.log, and it also defines a sink as the access_logs table defined above: 
- 
Start up Vector using the configuration above. Visit the Vector documentation for more details on defining sources and sinks. 
- 
Verify the access logs are being inserted into ClickHouse. Run the following query and you should see the access logs in your table: 
4. Parse the Logs
Having the logs in ClickHouse is great, but storing each event as a single string does not allow for much data analysis. Let's see how to parse the log events using a materialized view.
- 
A materialized view (MV, for short) is a new table based on an existing table, and when inserts are made to the existing table, the new data is also added to the materialized view. Let's see how to define a MV that contains a parsed representation of the log events in access_logs, in other words: There are various functions in ClickHouse to parse the string, but for starters let's take a look at splitByWhitespace - which parses a string by whitespace and returns each token in an array. To demonstrate, run the following command: Notice the response is pretty close to what we want! A few of the strings have some extra characters, and the user agent (the browser details) did not need to be parsed, but we will resolve that in the next step: 
- 
Similar to splitByWhitespace, the splitByRegexp function splits a string into an array based on a regular expression. Run the following command, which returns two strings. Notice the second string returned is the user agent successfully parsed from the log: 
- 
Before looking at the final CREATE MATERIALIZED VIEW command, let's view a couple more functions used to cleanup the data. For example, the RequestMethodlooks like "GET with an unwanted double-quote. Run the following trim function, which removes the double quote:
- 
The time string has a leading square bracket, and also is not in a format that ClickHouse can parse into a date. However, if we change the separator from a colon (:) to a comma (,) then the parsing works great: 
- 
We are now ready to define our materialized view. Our definition includes POPULATE, which means the existing rows in access_logs will be processed and inserted right away. Run the following SQL statement: 
- 
Now verify it worked. You should see the access logs nicely parsed into columns: NoteThe lesson above stored the data in two tables, but you could change the initial nginxdb.access_logstable to use the Null table engine - the parsed data will still end up in thenginxdb.access_logs_viewtable, but the raw data will not be stored in a table.
Summary: By using Vector, which only required a simple install and quick configuration, we can send logs from an Nginx server to a table in ClickHouse. By using a clever materialized view, we can parse those logs into columns for easier analytics.
