In this study we designed a time/space efficient data structure and algorithms to store and query pharmacogenomics within a smart contract in a private Ethereum blockchain. The data was stored on a small blockchain network with only four nodes. Each data point was inserted to our smart contract as a single transaction. All data had to be stored on-chain. No off-chain data storage was allowed.
Blockchain technology explained
Here we provide a primer on blockchain technology for those unfamiliar with the basics. For a more rigorous treatment of this material, the reader should consult the bitcoin white paper and the ethereum yellow paper [5, 8].
As outlined earlier, a blockchain is a secure data structure shared in a peer-to-peer network [Fig. 1a-b]. The chain is made up of blocks of data forward-linked by hashes. Each block in the chain contains a header and a list of valid transactions. The header includes several fields relevant to both the integrity of the data structure (e.g. the timestamp), and the parameters of the network (e.g. mining parameters). These fields include the block timestamp, block number, mining parameters, and the hash of the previous block’s contents (which links one block to the next). Users or nodes can submit transactions which modify the state of the network, for example by sending value from one user address to another, or storing a value within a smart contract (discussed further below).
The network consensus mechanism determines which user in the network will append the transactions to the chain as a new block. This is a critical process; if it were to fail, the validity of the added block would be compromised. The most prominent consensus mechanisms include proof of work, proof of stake, and proof of authority. In a proof of work (PoW) network, nodes "mine" blocks; that is, they exert computational energy to identify values which, when added to the incoming block‘s contents, yields a block hash below that of a set network parameter referred to as the difficulty. The difficulty is a network-wide parameter which can be varied in order to regulate the rate of block formation. The proof of work mechanism is critical to public cryptocurrency blockchain networks such as Bitcoin because it sets a cost for modifying the blockchain, thereby deterring bad actors from corrupting the chain [5]. Many have criticized the PoW mechanism for being unsustainable in the long run because of the large amounts of energy required to perform computational mining [12]. To address this criticism, the proof of stake mechanism has been proposed [13]. With proof of stake, a network algorithm determines which node will add the block to the chain based on the node’s stake, a combination of parameters including their account balance [14]. The PoS mechanism does not require heavy computation, thereby reducing the energy usage in the network. These consensus mechanisms are essential for a public blockchain network where anyone in the world can run a node and potentially modify the chain. However, in a private, permissioned network, these mechanisms can be replaced with a simple proof of authority mechanism [5]. PoA is a modified version of PoS with identity as the only ’stake’ [14]. We tested our smart contracts on a private test network using proof of authority.
Ethereum blockchain and smart contracts
A broad view of Ethereum might divide the world into three parts: blockchain, network trie roots, and trie data structures [Fig. 1b]. The blockchain logs the network’s state at specific times, after transactions have altered the state. The state of the network is stored in merkle patricia trees each of which possess a top hash. Blocks in the chain store these top hash values, but do not store all the data in the blockchain network (it would be far too large). The state data is stored in a database layer using leveldb [15, 16]. User account data and smart contract data (including the code and the actual data inserted via the code) are stored in these trie data structures, which are synced by "full" and "archival" nodes only (nodes which require significantly more computational power and storage) [16]. These nodes are integral to the health of the network. However, the Ethereum protocol also contains a "light" node option, in which only the block headers are synced [9].
Ethereum can handle a wide range of transactions via smart contracts, self-executable turing-complete programs which run in the Ethereum Virtual Machine (EVM) and maintain state in their own storage [8]. The EVM has a stack-based architecture, and can either store things on the stack (e.g. bytecode operations), in memory (e.g. temporary variables within functions), or in storage (e.g. permanent variables holding database entries). Each smart contract can read and write to its own storage only. In order to discourage developers from writing inefficient or unwieldy smart contracts, there is a ’gas’ cost associated with each storage and retrieval command. Just as blockchain users have an address, a smart contract’s state resides at a particular unique address in the global state of the Ethereum network, which users can call. If a user does not have enough gas, the contract call and corresponding transaction cannot be completed. Smart contracts provide an opportunity to develop applications with complex functionality in a blockchain network. We leveraged the flexibility of smart contract programming to create a challenge solution and alternate fastQuery solution that insert pharmacogenomics data in a custom way in contract storage to maximize storage and query efficiency.
Network configuration
We developed and tested our solutions in Truffle v5.1.18, a command line tool for Ethereum, which provided us with a built-in JavaScript test environment. Within this network, we tested insertion and querying with up to 10,000 entries in the database. Our setup constituted a “development network”, which was separate from the public Ethereum network. This private configuration allowed us to develop and test our contracts extensively, without having to deploy a new contract each time. While network configuration can drastically affect performance in a Proof-of-Work scheme, it has minimal effects in a Proof-of-Authority scheme. Schaffer et al. demonstrate that increasing the number of nodes in a private, Proof-of-Authority Ethereum network does not significantly affect performance [17].
Chain initialization
Chain initialization in Truffle is automated, and only requires setting basic parameters such as gas limit and network name in a config file. We used the default gas limit and price values for the development network (4712388 and 100 gwei, respectively). Together, gas limit and price values determine the maximum amount of Ether one can spend on transaction costs (one can spend no more than gas price * gas limit). In a proof of work scheme, the gas price also affects transaction speed, as miners will mine transactions with a more profitable gas price [18]. See [8] for a detailed explanation of the gas price and gas limit parameters.
Challenge solution: an index-based, multi-mapping smart contract
Our challenge solution utilizes four storage mappings, linked to one another by a unique integer ID assigned to each inserted observation in the database [Fig. 1c]. Mappings are similar to hash tables, and allow efficient key-value lookup. In the first of the four mappings, which we call the database, we store the pharmacogenomics data and assign to each entry a unique ID to serve as the mapping key. We store each observation in its own struct, a composite data type which can hold multiple fields. In the three other mappings, we use the gene names, variant numbers, and drug names as keys to an array of relevant IDs. Thus, given a gene name key, one can retrieve a list of IDs that can key into the database mapping and return an observation matching that particular gene name. We chose this implementation in order to reduce the number of loops required to check the data in the database and return matches, and thereby achieve time/memory efficient querying.
Challenge solution: insertion
Data can be inserted into a smart contract via one-line commands in a JavaScript console or from an external script. We wrote an insertion function within the contract to insert a single observation for a given gene name-variant number-drug name combination. In our case the observation consists of the gene name, variant number, drug name, outcome (improved, unchanged, or deteriorated), suspected gene-outcome relation (true/false), and serious side effect (true/false). Upon passing in the observation, the function executes the following steps, (1) Convert each field of the observation to the desired data type for storage (e.g. it is more efficient to store strings as bytes32 types in storage); (2) if the gene name, variant, drug combination does not already exist in the database, add it to an array holding only the unique gene name, variant, and drug name combinations; (3) Use the gene name, variant number, and drug name to key into their respective mapping and append to the value array the counter as the ID, where counter is a globally updated variable; (4) push to the database mapping a struct with key-value pair: counter-[struct holding the observation]; (5) increment counter variable by one (see Algorithm 1).
We store the gene name, variant number, and outcome fields as bytes32 variables, a fixed-size bytes array that uses less ’gas’ in Ethereum relative to strings and is compatible with basic utilities in Solidity (for example, it is simpler to check the length of a bytes variable than a string variable using Solidity). We store the drug name as a string because many drug names are lengthy and can exceed the 32-character limit of the bytes32 data type. Booleans are straightforward in Solidity, and addresses are types to conveniently store the user or contract addresses on the blockchain.

Challenge solution: query algorithm
Our challenge solution can handle both three-field queries, where the fields are gene name, variant number, drug name, and wildcard queries. One can query by any combination of these three fields, or simply specify gene="*", variant= "*", drug= "*", which should return the entire contents of the database. To query, we first check how many fields have been queried, and for those that were, we use the query fields to key into the appropriate mapping and return the value associated with that key– an array of integer IDs corresponding to the ID of the data in the database mapping. If all fields were "*" we use all IDs currently in use to return the data from the database mapping. Otherwise, we determine which of the ID-arrays has the minimum length, and loop through it (since its contents will limit the results of the set intersection). For each entry in this minimum-length ID array, for every other field that was searched we loop through that ID array and check whether it matches the ID in the outer loop. At the end of the outer loop, if the number of matches is equal to the number of fields searched, then this integer is used to key into the database mapping and grab the struct value, which is then saved to a memory array. We then loop through the array of unique combinations and for each one pool the structs in the search results that match that unique combination. This allows us to output data in a useful way: for a given gene name, variant number, and drug name, how many observations are there, what number and percentage of cases saw serious side effects, what number and percentage suspected an interaction between the gene and drug, and what number and percentage saw an improved, deteriorated or unchanged outcome when administered the drug (see Algorithm 2).

fastQuery solution: a pooled, index-based, multi-mapping smart contract
Rather than storing each observation in its own struct, our alternate, fastQuery solution stores a single struct for each unique gene-variant-drug relation. We utilize four storage mappings linked to one another by an ID assigned to each inserted observation in the database, and an additional fifth mapping for linking IDs to their unique gene-variant-drug relation. This alternate design reduces the number of IDed entries in the database, and prevents indefinite database growth (there is an infinite number of raw observations that can be inserted into our challenge solution, but a finite number of unique gene-variant-drug relations that exist). We chose this implementation in order to reduce the length of loops required to check the data in the database and return matches, and thereby further improve the query time.
fastQuery solution: insertion
We wrote a new insertion function to insert a single observation for a given gene name-variant number-drug name combination. Upon passing in the observation, the function executes the following steps, (1) Convert each field of the observation to the desired data type for storage; (2) Use the input gene, variant, and drug as a combination key into the fifth mapping, which returns the ID for that unique relation; (3) Use the ID as a key into the database mapping and check whether this relation already exists in the database; (4) if not, fill in the name information from the input data, add the gene, variant, and drug names as keys in their respective mappings with the ID as the value, and increment the global ID counter by 1; (5) increment the total count, improved count, deteriorated count, suspected relation count, and side effect count fields of the relation struct in the database based on the inserted data. We store the variables as the same data types used in our challenge solution (see Algorithm 3).

fastQuery solution: query algorithm
Our fastQuery solution can handle both unique and wildcard queries, like the challenge solution. Yet, with fastQuery, we do not need to iterate through an array of unique gene-variant-drug combinations because the data are already stored pooled in these unique relations. Thus, we simply obtain the matches using the challenge solution, convert the count data to percentages, and output the search result (see Algorithm 4 and Additional file 1).
