In this study, we aimed to design a time/space efficient data structure and mechanisms to store and retrieve the logs based on a blockchain by using the MultiChain-1.0.4 platform. The logs were stored on a small blockchain network consisting of four nodes. Each line in the data access log needed to be saved individually as one transaction and all log data and intermediate data had to be saved on-chain. No off-chain data storage was allowed.
Blockchain in a nutshell
As described earlier, a blockchain is a collection of blocks that can be analogous to an append-only list. To ensure the immutability of the blockchain, every time a transaction is added to the system a group of miners needs to validate it. Blocks hold batches of valid transactions. A transaction moves a piece of value from a wallet address to a new wallet address. A wallet address is an alphanumeric identifier for a possible destination on the chain. Unique addresses are used for transactions; however, a user can have multiple addresses. The transaction is considered to be valid and added to a block when every party in the network (i.e., miners) validates that the sender has a sufficient amount of value and the address of the sender and the initiater of the transaction are the same. The addition of the block to the chain generates a transaction ID using a cryptographic hash function. This ID depends on various features of the chain such as the content of the block, time, and previous transaction IDs; therefore, mining blocks (i.e., approval of transaction IDs) requires time and resources. To add complexity to the transaction hashes, some blockchain platforms utilize concepts like proof-of-work. The process of proof-of-work is to search for a piece of data that is extremely difficult to produce, which is called “nonce” (i.e., number only used once). Nonce becomes the part of the transaction ID that the miners look for, and thus increases the difficulty of the mining process.
Data streams in multiChain
MultiChain data streams allow a blockchain to be used as a general purpose database. Blockchain provides time stamping, notarization, and immutability to the stored data. A MultiChain blockchain can contain any number of streams. The data published in every stream is stored by every node in the network. Each data stream on a MultiChain blockchain consists of a list of items. Each item in the stream contains the following information, as a JSON object [12]:
-
A publisher (string)
-
A key (between 1-256 ASCII character, excluding whitespace and single/double quotes) (string)
-
Data (hex string)
-
A transaction ID (string)
-
Blocktime (integer)
-
Confirmations (integer)
When a server connects to an existing chain through the MultiChain platform, it is assigned a wallet address for that chain. Multichain addresses differ from Bitcoin addresses, such that addresses created on one MultiChain blockchain cannot be valid on a second chain. This prevents an accidental operation from being performed on one chain that was intended for another [12]. When a data stream is created on a chain, servers with a wallet address for the chain may be granted permission to subscribe and publish to the data stream. Their wallet address for the chain is their publisher ID on the stream. To publish an item to a stream, the publisher must provide a key in the form of a string, and some data in the form of a hex as an input. An example stream item is shown in Fig. 1. Another property of MultiChain data streams is that requirement of miners can be turned off. This is because the stream items are added to a block based on the time the item is published [12].
Chain creation
This step is common to both solutions we propose. Users on a primary node (server) provide a chain name and a stream name, and optionally, a second node address for which they have secure-shell (SSH) credentials. If no secondary server is supplied, the application calls MultiChain commands on the primary node to create a chain, initialize the chain, create a stream, and then subscribe the node to the stream. If a secondary server address is supplied, the application creates and initializes the chain, and then prompts the user for their password for SSH access to the second server and requests access to connect to the chain. The application then grants access to the second server from the first server. From there, it provides SSH access into the second server, and connects to the chain. On the first server, a stream is created and the first and second servers are subscribed to the stream. A flowchart of this process and the representation of the log file in the data stream are depicted in Fig. 2.
Challenge solution: storing data in data streams
We stored the genomic access logs as the keys of the stream items of a block. This can be done due to the short string lengths of each line in the log files, as a key can be a maximum of 256 characters in length. However, we note that if data that is larger than 256 characters needs to be stored, this solution can still be effective by using string compression techniques with a time and memory overhead. For querying, we first used the MultiChain API call liststreamitems to retrieve all of the stream items from a blockchain. We then parsed the returned JSON object to grab the keys from the stream items. The novelty of this solution is that we stored the data in the key field of the stream item, instead of the data field. As data field stores the data in a hex format, converting the pulled hex into plain text takes another operation and increases the query time. After parsing all of the keys in the stream, we stored the data locally in a Python Pandas data frame for further querying, after which we discarded the data frame (see Additional file 1).
Insertion MultiChain data streams allow for keys of up to 256 characters in length, excluding white space and single/double quotes. Each row of the genomic access logs is no more than 94 characters in length, so our solution was to first convert each tab-separated row into a single string, using the literal characters, and then publish each row to the stream as its own key, followed by the row number as the data-hex. The algorithm for inserting usage logs from text files into an existing MultiChain data stream consists of the following steps: (1) convert each row in the file to be compatible with the MultiChain key feature and (2) publish each row as a separate transaction to the key of a data stream on a chain.
Challenge solution: querying data from data streams
The querying options in the MultiChain-1.0.4 API are limited. Stream items could be queried on transaction IDs, timestamps, or stream keys. Additionally, while the software allows for querying on streams using keys, it could not query partial keys or keys containing wildcards. Our solution, therefore, was to list all stream items, grab the keys, and create a data frame of all of the keys, which could then be queried on. The algorithm for querying usage logs from a data stream on a MultiChain consists of the following steps, as shown in the flowchart in Fig. 3. We first download the stream items locally to the memory. This creates a JSON object per stream items. After parsing the keys, we create a data frame from the stream item keys. We then perform queries on the data frame by creating a dictionary from the user query. If the user requests sorted results, then the results are returned after sorting them (see Additional file 1).
Alternate bigmem solution: storing data in data streams
In this solution, instead of keeping the data in the key field of the streams, we used the data-hex field to keep the data, while using the timestamps as the key.
Insertion In our insertion process, we store a main and several auxillary records per log entry. We first store the entire entry to the stream as the main record and get a unique transaction id. For each field in a entry, we insert an auxiliary record, using field:value as key, and timestamp:txid of main record as value (hex encoded). For example, if our record contained user 6, we would store an auxilary record mapping user:6 to the transaction id of the main record. We do this for all columns of the data, in this case 7. This increases the insertion time and storage, but comes in handy when we do the queries with less memory requirement.
Alternate bigmem solution: querying data from data streams
In this solution, we take advantage of indexing we created with the transaction ids. For each query element, we first find all auxiliary records that match, and save their values as a set for each element. We then take the set intersection. This will give us timestamp:txid of all main records that matched all the query elements, apart from the start and end time criteria (if any). We filter the resulting set by start and end time, if necessary. For each surviving element of the filtered set, we extract the transaction ids from timestamp:txid, and query for main record. The set of records returned is our query result.