Parsing Blockchain Data and Building the Analytics Dataset

Parsing Blockchain Data

Learning Objective

  • Comparing in-place and external data analysis
  • Relating external data
  • Deciding what data you need
  • Assembling the data your models will need

Comparing On-Chain and External Analysis Options

Real-Time Parsing

In real-time parsing, data is read directly from the blockchain as it is generated. Here’s how it works:

Data Retrieval from Blockchain:

  • The process begins by fetching data directly from the blockchain. This data could include transactions, smart contract events, token transfers, or any other relevant information.
  • For example, consider a decentralized application (DApp) that tracks real estate transactions on a blockchain. As new property sales occur, the DApp reads the transaction data in real time.

Fetching Additional Contextual Data:

  • Real-time parsing doesn’t stop at blockchain data alone. It often requires additional context to make sense of the raw information.
  • For instance, the DApp might need to fetch property details (such as location, price, and ownership history) from an external database or API. This supplementary data enriches the blockchain data.

Complete Dataset for Analysis:

  • Once all relevant data (both from the blockchain and external sources) is collected, it forms a complete dataset.
  • This dataset is then handed over to the analytics code for further processing. The code can perform various tasks, such as fraud detection, trend analysis, or predictive modeling.

In real-time parsing, data is read directly from the blockchain as it is generated. Here’s how it works:

  • Example: Suppose we have a supply chain management system built on a blockchain. The system tracks the movement of goods from manufacturer to retailer. Real-time parsing involves reading shipment data from the blockchain and fetching additional details like weather conditions during transit. The combined dataset enables the system to optimize delivery routes based on historical weather patterns.

2. Two-Step Analysis

The two-step analysis approach involves an intermediate storage step before performing analytics. Here’s how it works:

Blockchain Data Retrieval and Storage:

  • Initially, data is fetched from the blockchain (similar to real-time parsing).
  • However, instead of immediately analyzing this data, it is stored in an off-chain repository (such as a database or cloud storage).

Traditional Analytics Methods:

  • In the second step, traditional analytics methods are applied to the stored data.
  • These methods can include SQL queries, statistical analysis, machine learning algorithms, or custom scripts.
  • The goal is to extract meaningful insights from the combined blockchain and off-chain data.

Input Requirements Completion:

  • By combining blockchain data with other relevant information (which may not be available on-chain), the two-step analysis ensures that the input requirements for analytics models are met.
  • For example, a bank might use this approach to analyze customer transaction patterns. The blockchain data (transactions) is combined with customer profiles (stored off-chain) to build predictive models for credit risk assessment.

The two-step analysis approach involves an intermediate storage step before performing analytics. Here’s how it works:

Example: Consider a healthcare consortium that stores patient medical records on a private blockchain. The two-step analysis involves:

  • Storing patient records (encrypted for privacy) in an off-chain database.
  • Applying machine learning algorithms to predict disease outbreaks based on historical patient data, demographics, and environmental factors.

In summary, both real-time parsing and two-step analysis play crucial roles in extracting valuable insights from blockchain data. The choice between them depends on the specific use case, data availability, and analytical requirements.

Comparing one-of versus repeated analysis

One-Off Analysis

When you perform a one-off analysis, you’re addressing a specific question or task that requires analyzing data just once. Here are the key points:


  • You have a single, well-defined purpose for analyzing the data.
  • The analysis is not expected to be repeated frequently.

Data Processing Flow:

  • Fetch data directly from the blockchain (real-time parsing or other methods).
  • Analyze the data immediately without storing it elsewhere.
  • Results are delivered for the specific use case.


Use Case: A company wants to assess the impact of a recent marketing campaign on token sales.

Process: Retrieve transaction data from the blockchain, calculate token sales during the campaign period, and provide insights to the marketing team.

Repeated Analysis

Repeated analysis involves scenarios where data processing will occur multiple times. Consider the following aspects:


  • The model will be run more than once (even if initially intended as a one-time query).
  • Expect follow-up requests or periodic updates.

Data Processing Flow:

  • Fetch data from the blockchain (similar to one-off analysis).
  • Store the data in an off-chain repository (database, cloud storage, etc.).
  • Perform analytics on the stored data during subsequent runs.


Use Case: An insurance company wants to assess claim patterns over time.


Initial Pass: Retrieve claims data from the blockchain and store it in an external database.

Subsequent Runs: Apply statistical models periodically to identify trends, fraud patterns, or risk factors.


  • Sunk Cost: The first pass through blockchain data is a sunk cost. Regardless of subsequent actions, you must invest effort in collecting data initially.
  • Volatility: If data changes frequently (e.g., stock prices, weather conditions), fresh data is crucial for accurate analysis.
  • Trade-offs: Balancing performance, reusability, and meeting project deadlines is essential.

Remember that even seemingly one-time queries often lead to follow-up requests. By designing for reusability and efficiency, you build effective analytical models that adapt to changing data needs.

Assessing Data Completeness

In any data analytics project, understanding the available data is crucial. When working with blockchain data, especially from smart contracts, consider the following steps:

List Available Blockchain Data:

  • Start by identifying the data directly accessible from the blockchain.
  • Look at the state variables exposed by smart contracts. These variables hold important information.
  • You can find details about these variables in smart contract documentation or by examining the source code.

Beyond the Obvious:

  • Don’t stop at the obvious data points. Explore further to uncover additional relevant data.
  • Sometimes, the most valuable insights come from less apparent sources.

ABI and Smart Contract Source Code:

  • The ABI (Application Binary Interface) provides information about available data in a machine-readable format (often JSON).
  • Alternatively, inspect the smart contract source code. It not only reveals what data exists but also how the smart contract utilizes that data.


Suppose you’re building an analytics model for supply chain management using blockchain data. The smart contract tracks product ownership and shipment details. Here’s how you assess data completeness:

Initial Data Inventory:

  • Retrieve data from the blockchain, including product ownership changes and shipment events.
  • These transactions and state variables form the core dataset.

Additional Data Needs:

  • To analyze shipping efficiency, you require pickup and drop-off location information.
  • However, the smart contract doesn’t store physical addresses.
  • You’ll need to fetch address data from external sources (e.g., a geolocation API or an off-chain database).

By combining blockchain data with external address information, you’ll have a complete dataset for analyzing shipping efficiency and optimizing supply chain logistics.

Integrating External Data in Blockchain Apps

  1. Blockchain Data Limitations:
    1. Blockchain technology prioritizes data transparency and integrity over bulk storage or performance.
    1. Storing excessive data directly on the blockchain can increase costs.
    1. Blockchain excels when handling information directly related to value transfers (e.g., ownership changes).
  2. Transaction Data Minimalism:
    1. Ownership transfer transactions require only essential data for recording the transaction.
    1. Supporting data (such as demographics) may exist but doesn’t need to be stored on-chain.
    1. The goal is to keep blockchain blocks concise and efficient.
  3. GDPR and the “Right to Be Forgotten”:
    1. GDPR mandates that organizations allow consumers to remove their private data.
    1. To address this, blockchain designs limit on-chain storage of personally identifiable information (PII).
    1. Instead of storing PII directly, blockchain blocks often contain hashes (pointers) to off-chain records.
  1. Orphaned Blockchain Pointers:
    1. Removing PII from off-chain data satisfies the “right to be forgotten.”
    1. Although the blockchain data (pointer) remains, it no longer points to valid off-chain data.
    1. This creates an orphan condition where the blockchain data lacks a corresponding off-chain record.


  • Consider a supply chain management DApp using blockchain.
  • The SupplyChain.sol smart contract defines a participant UUID (uniquely identifying participants).
  • When retrieving a participant’s data via the getParticipant() function, you get the UUID.
  • To complete the story, you fetch additional details (e.g., name, address) from an external database using SQL queries.
  • The blockchain data (UUID) acts as a key to link on-chain and off-chain information.

In summary, blockchain data often serves as a starting point, and related external data completes the picture. By judiciously managing what goes on-chain and leveraging external resources, effective blockchain applications can balance transparency, privacy, and efficiency.

Building an Analysis Dataset

Example Scenario: Tracking Participants in a Blockchain Project

Suppose we are working with a blockchain project where various participants (such as investors, developers, and users) interact within a decentralized application (dApp) on the Ethereum blockchain. Our goal is to analyze the engagement and activity level of these participants.

Step-by-Step Process:

1. Setting up Connections

First, we need to establish connections to the Ethereum blockchain and the CSV files:

  • Ethereum Blockchain: We connect to a local Ethereum blockchain instance (e.g., Ganache) for development and testing purposes.
  • CSV Files:
  • participantDetails.csv: Contains detailed information about each participant, such as name, email, and role.
  • dataSet.csv: This is the file where we will write our combined data.

Here’s the Python setup for these connections:

from web3 import Web3

# Connect to the local Ethereum blockchain

ganache_url = “”

web3 = Web3(Web3.HTTPProvider(ganache_url))

# Open the CSV files for reading and writing

fileHandleIN = open(‘participantDetails.csv’, ‘r’)

fileHandleOUT = open(‘dataSet.csv’, ‘w’)

2. Reading and Combining Data

We will use the getParticipant() function from a smart contract deployed on the blockchain to fetch participant IDs. We then use these IDs to look up additional details in the participantDetails.csv file.

Python Code

import csv

# Assuming getParticipant() returns a participant ID

# and participantDetails.csv has columns: ID, Name, Email, Role

reader = csv.DictReader(fileHandleIN)

writer = csv.DictWriter(fileHandleOUT, fieldnames=[‘ID’, ‘Name’, ‘Email’, ‘Role’, ‘BlockchainAddress’])

# Write the header to the output file


# Example function to simulate fetching a participant ID from the blockchain

def getParticipant():

    # Simulating fetching participant data; returns a sample ID

    return “participant123”

# Fetch participant ID from blockchain

participant_id = getParticipant()

# Search for the participant in the CSV and write to the output file

for row in reader:

    if row[‘ID’] == participant_id:

        # Assume we fetch the blockchain address from another blockchain function

        blockchain_address = web3.eth.getAccounts()[0]  # Just an example address

        row.update({‘BlockchainAddress’: blockchain_address})


# Close file handles



3. Resulting Dataset

The resulting dataSet.csv will contain a combination of data fetched from both the Ethereum blockchain and the participantDetails.csv file. Each row in dataSet.csv represents a participant with fields from the CSV and their blockchain address added.

Example Scenario: Enhancing Data Quality for Sales Performance Analysis

Assume that after initially fetching and combining data as described, you encounter several data quality issues:

Missing Data: Some participants’ email addresses or roles are missing.

Malformed Data: Some entries have postal codes in incorrect formats.

Inconsistent Units: Participant ages are recorded in different units (some in years, some in months).

Steps to Clean and Normalize the Data

Step 1: Handling Missing Data

When data is missing, you have several options:

  • Ignore: Skip entries with missing data points.
  • Fill: Impute missing values using a common strategy (e.g., median, mean, mode, or a fixed value like ‘Unknown’).
  • Flag and Fill: Mark data as imputed for transparency and fill it.

import pandas as pd

# Load the dataset (simulating with a DataFrame)

data = pd.DataFrame({

    ‘ID’: [‘001’, ‘002’, ‘003’, ‘004’],

    ‘Email’: [‘’, None, ‘’, ”],

    ‘Role’: [‘Developer’, ‘Investor’, ‘User’, None]


# Fill missing emails with a placeholder and empty strings with ‘Unknown’

data[‘Email’] = data[‘Email’].fillna(‘no_email_provided’).replace(”, ‘Unknown’)

# Fill missing roles with the most common role

most_common_role = data[‘Role’].mode()[0]

data[‘Role’] = data[‘Role’].fillna(most_common_role)


Data Cleaning

Step 2: Correcting Malformed Data

For data like postal codes that may be malformed or not standardized, apply specific rules or regular expressions to correct or standardize them.

Python Code Example for Correcting Postal Codes:

# Simulating postal code corrections

data[‘PostalCode’] = [‘12345’, ‘ABCDE’, ‘9876’, ‘123AB’]

# Regex to find valid US postal codes (5 digits)

data[‘PostalCode’] = data[‘PostalCode’].apply(lambda x: ‘Invalid’ if not x.isdigit() or len(x) != 5 else x)


Step 3: Normalizing Data

Convert all ages from months to years if they are mistakenly recorded in months, ensuring consistency across the data set.

Python Code Example for Normalizing Ages:

# Assume ages are mixed with some entries in months

data[‘Age’] = [240, 25, 360, 30]  # Age in months and years

# Normalize ages: if age is greater than 100, assume it’s in months and convert to years

data[‘Age’] = data[‘Age’].apply(lambda x: x // 12 if x > 100 else x)



   ID              Email       Role PostalCode  Age

0  001  Developer      12345   20

1  002  no_email_provided   Investor    Invalid   25

2  003       User       9876   30

3  004           Unknown       User      123AB   30

By addressing these typical issues—missing, malformed, and inconsistently scaled data—you enhance the reliability of your dataset, making it more suitable for accurate analysis. The cleaning process, as illustrated, helps ensure that your models are built on a solid foundation of quality data, improving their effectiveness and the validity of your insights.

Long Answer Questions

  1. Evaluate the effectiveness of real-time parsing versus two-step analysis in blockchain data analytics. Discuss scenarios where each method would be optimally used and justify your reasoning based on the data integrity and analysis latency.
    • Answer Hint: Consider real-time parsing for scenarios requiring immediate data usage, like fraud detection or real-time market analysis, where latency is crucial. Two-step analysis might be more suitable for complex analytics that require historical data comparisons or where data can be batch processed.
  2. Analyze how the integration of external databases with blockchain data can enhance the analytical capabilities of blockchain applications. Propose a framework that outlines key considerations for maintaining data integrity and privacy.
    • Answer Hint: Discuss the balance between on-chain and off-chain data storage, the role of APIs in data retrieval, and the importance of encryption and access controls to protect privacy while ensuring data is comprehensive for detailed analysis.
  3. Design a blockchain-based system for tracking supply chain logistics that employs both real-time parsing and two-step analysis. Outline how each method contributes to the overall efficiency and accuracy of the supply chain monitoring.
    • Answer Hint: Real-time parsing could be used for immediate updates and alerts on logistics movements, while two-step analysis could provide deeper insights into supply chain efficiency, predictive maintenance, or optimization strategies based on historical data.
  4. Critically assess the trade-offs involved in repeated analysis versus one-off analysis in blockchain environments. Consider factors such as computational overhead, data freshness, and practical applicability in different industrial applications.
    • Answer Hint: Evaluate the sunk cost of data retrieval, the importance of up-to-date information in volatile markets or environments, and the benefits of building reusable data models that can provide ongoing insights versus single-instance data queries.
  5. Discuss the potential implications of GDPR and the “Right to Be Forgotten” on blockchain applications that integrate personal data. Propose solutions that blockchain developers can implement to comply with these regulations while leveraging the immutable nature of blockchain.
    • Answer Hint: Explore the use of hashing to store personal identifiers off-chain, the challenges associated with updating or deleting information on an immutable ledger, and potential architectural solutions like sidechains or off-chain data stores that allow for compliance without compromising the benefits of blockchain technology.