FSharpGephiStreamer


Introduction to Exploratory data analysis using FSharpGephiStreamer and Gephi

23/1/2019 (applies to all site/dowload requests done for this document) ; Kevin Schneider

Table of contents

Introduction

Exploratory data analysis is an essential part of data analysis, especially if you are working with large datasets. It is always helpful to visualize your data to have an idea of the tendencies and structure of it. In the context of networks, gephi has proven to be a powerful tool various CSB projects for this exact purpose.

For the purpose of this tutorial/walkthrough, we will create a node and edge list from real data and stream those to gephi. Afterwards we will explore the resulting network a little.

However, this is not intended to be a guide on how to use gephi in general, although a few words will be said about the things done inside gephi to visualize the network.

Note: It is currently planned to flesh out the analysis of the network to become a full blog post on our blog. A link will be added here when that is done.

The dataset

In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains. Every field creates ontologies to limit complexity and organize information into data and knowledge. (from wikipedia)

Ontologies are providing an extensible and queryable knowledge base. In the context of computational biology, they are a valuable tool to characterize all kinds of biological processes and/or entities and are often used to see if specific types of these are enriched in an experiment (ontology enrichment).

The dataset of interest for this tutorial is the knowledgebase provided by the Gene Ontology Consortium (also known as GO). It provides concepts/classes used to describe gene function, and relationships between these concepts.

One of the main uses of the GO is to perform enrichment analysis on gene sets. For example, given a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for that gene set. (from GO's website)

The full ontology can be downloaded here.

Exploratory data analysis using FSharpGephiStreamer & Gephi

The data was originally parsed using the Obo parser from our bioinformatics toolbox BioFSharp. if you want to see the code , expand the section below. However, to avoid dependencies and assure reproducibility of this tutorial the data was also prepared to be usable without any dependency other than FSharpGephiStreamer itself. The Node and Edgelists can be found as .csv files here. If you want to reproduce this analysis, just parse these files and construct the node and edge types from them. Just keep in mind that you lose a lot of information contained in the obo file that way, as the csv files only contains term names and is-A relationships

Preparing nodes and edges

We define nodes as GO terms as our nodes and edges as Is-A relations between those terms. This will result in a network that shows how the knowledgebase is structured. There are a few interesting thigs that can be visualized by this:

  • The most descriptive terms: The nodes with the highest In-Degree are the terms which describe the most knowledge in the network. Maybe we can also infere from this what the main fields of (geneomic) biological research are.
  • Sub graphs of the network may show that there are different well described knowledge types that are highly differentiaded from each other
  • Connectivity between hubs: Terms that connect subgraphs or hubs and act as 'knowledge glue'

However, there is much more information in the obi file than these relationships. Visualizing other relationships is a topic for another day.

1: 
2: 
open FSharpGephiStreamer
open FSharpGephiStreamer.Colors
 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
/// Simplified GO Term as node
type GONode = { 
    //GO term (e.g. "GO:0000001")
    Id              : string
    //full term description e.g. "RNA polymerase I transcription factor complex"
    TermDescription : string
    //e.g. "biological process"
    NameSpace       : string
    //The color for the node
    Color           : Colors.Color
    }

/// Creates GONode
let createGONode id descr nameSpace col =
    {Id = id; TermDescription = descr; NameSpace = nameSpace ; Color = col}

/// Represents the Is_A relationship of GO terms as a directed edge
type GOEdge = {
    Id          : int
    Source      : string
    Target      : string
    TargetColor : Colors.Color
}

/// Creates GOEdge
let createGOEdge i source target tc = {Id = i; Source = source; Target = target; TargetColor = tc}

Data aquisition

First we parse the .obo file using BioFSharps Obo parser:

Parsing the csv files can be done without dependencies using this code:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
open System.IO
open System.Text

let readFromFile (file:string) =
        seq {use textReader = new StreamReader(file, Encoding.Default)
             while not textReader.EndOfStream do
                 yield textReader.ReadLine()}

let nodes = 
    readFromFile (__SOURCE_DIRECTORY__ + "/data/goNodeList.csv") 
    |> List.ofSeq
    //Skip the header line of the csv file
    |> List.skip 1
    |> List.map (fun n -> let tmp = n.Split([|','|])
                          createGONode tmp.[0] tmp.[1] tmp.[2] (Colors.Table.StatisticalGraphics24.getRandomColor())) 

let edges = 
    readFromFile (__SOURCE_DIRECTORY__ + "/data/goEdgeList.csv") |> List.ofSeq
    |> List.ofSeq
    //Skip the header line of the csv file
    |> List.skip 1
    |> List.map (fun n -> let tmp = n.Split([|','|])
                          createGOEdge (tmp.[0] |> int) tmp.[1] tmp.[2] (nodes |> List.find(fun n -> n.Id = tmp.[2])).Color) // this will take some time but ensures that edges have the same color as target nodes.
 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
open FSharpAux
open FSharpAux.IO
open BioFSharp
open BioFSharp.IO
open BioFSharp.IO.Obo

let readFile path =
    FileIO.readFile path
    |> Obo.parseOboTerms
    |> Seq.toList
 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
let goObo = readFile @"go.obo"



///Node list containing all GO terms
let goNodes =
    goObo
    |> List.map (fun x -> (createGONode x.Id x.Name x.Namespace (Colors.Table.StatisticalGraphics24.getRandomColor())))


///Edge list containing all Is-A relationships in the knowledge base
let goEdges =
    goObo
    //ignore terms that have no Is-A relationship to any term
    |> List.filter (fun x -> not (List.isEmpty x.IsA))
    |> List.map (fun x -> 
                            [for target in x.IsA do 
                                yield ( x.Id, 
                                        target , 
                                        //ensure edges have the color of the node they target;
                                        (goNodes |> List.find(fun node -> node.Id = target) ).Color)
                                        ])
    //Aggregate resulting edges in a single list
    |> List.concat
    |> List.mapi (fun i (sourceId, targetId, col) -> createGOEdge (i+1) sourceId targetId col)

Streaming to gephi

The Grammar module provides a set of rules that will convert the streamed data into JSON objects which gephi will understand. To stream the nodes and edges to Gephi, we need a converter function for both. This function will take: The edge/node A list of Grammar attributes that define the mapping of attributes of the data to Gephi-readable attributes

We then use functional compostion with the Streamer.addNode/Streamer.addEdge functions to create our final addNode/Edge functions.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
let addOboNode (node:GONode) =

    let nodeConverter (node:GONode) =
        [
            Grammar.Attribute.Label (sprintf "%s | %s {%s}" node.Id node.TermDescription node.NameSpace); 
            Grammar.Attribute.Size  10.; 
            Grammar.Attribute.Color node.Color; 
            Grammar.Attribute.UserDef ("UserData",node.TermDescription); 
        ]
    Streamer.addNode nodeConverter node.Id node


let addOboEdge (edge:GOEdge) =

    let edgeConverter (edge:GOEdge) =
        [
            Grammar.Attribute.Size  1.; 
            Grammar.Attribute.EdgeType  Grammar.EdgeDirection.Directed;             
            Grammar.Attribute.Color edge.TargetColor ;      
        ]
    
    Streamer.addEdge edgeConverter edge.Id edge.Source edge.Target edge

To stream the nodes and edges to gephi, we use the addNode/addEdge functions on the list of nodes/edges:

1: 
2: 
goNodes |> List.map addOboNode
goEdges |> List.map addOboEdge

Alternatively, when using the data parsed from the provided csv files:

1: 
2: 
nodes |> List.map addOboNode
edges |> List.map addOboEdge

Thats it. in roughly 40 lines of code we streamed a complete knowledge graph with 47345 nodes and 77187 edges to gephi. The network is now ready to be explored.

Results

The network

After applying some styles in the preview section (e.g. black background, rounded edges) the final rendered network looks like this:

Network sections

By eye, there are 9 large communities in the network, clustering knowledge about the following processes/entities (click to view a close-up):

Binding Transferases Regulation Protein-Complex Metabolic-Processes Oxireductases Intracellular-Transport Reproduction Transmembrane-Transport

Metrics

Average Degree & Degree distribution

The average Degree is 1.63. The degree distribution is highly right-skewed (meaning many nodes have a low degree, and there exist hubs with high degree). This is a property of a real network.

Modularity

Calculating the network modularity with a low resolution, the large communities correlate well with the previous by-eye observation, although some of these communities split into large sub-communities: The overall modularity of the network with a resolution of 3 is 0.89 (high modularity).

Below is the network with nodes colored by their community membership:

Close up of some communities

Binding

Back to overview

Transferases

Back to overview

Regulation

Back to overview

Protein-Complex

Back to overview

Metabolic-Processes

Back to overview

Oxireductases

Back to overview

Intracellular-Transport

Back to overview

Reproduction

Back to overview

Transmembrane-Transport

Back to overview

namespace FSharpGephiStreamer
module Colors

from FSharpGephiStreamer
type GONode =
  {Id: string;
   TermDescription: string;
   NameSpace: string;
   Color: Color;}


 Simplified GO Term as node
GONode.Id: string
Multiple items
val string : value:'T -> string

--------------------
type string = System.String
GONode.TermDescription: string
GONode.NameSpace: string
Multiple items
GONode.Color: Color

--------------------
type Color =
  {A: byte;
   R: byte;
   G: byte;
   B: byte;}
type Color =
  {A: byte;
   R: byte;
   G: byte;
   B: byte;}
val createGONode : id:string -> descr:string -> nameSpace:string -> col:Color -> GONode


 Creates GONode
val id : string
val descr : string
val nameSpace : string
val col : Color
type GOEdge =
  {Id: int;
   Source: string;
   Target: string;
   TargetColor: Color;}


 Represents the Is_A relationship of GO terms as a directed edge
GOEdge.Id: int
Multiple items
val int : value:'T -> int (requires member op_Explicit)

--------------------
type int = int32

--------------------
type int<'Measure> = int
GOEdge.Source: string
GOEdge.Target: string
GOEdge.TargetColor: Color
val createGOEdge : i:int -> source:string -> target:string -> tc:Color -> GOEdge


 Creates GOEdge
val i : int
val source : string
val target : string
val tc : Color
namespace System
namespace System.IO
namespace System.Text
val readFromFile : file:string -> seq<string>
val file : string
Multiple items
val seq : sequence:seq<'T> -> seq<'T>

--------------------
type seq<'T> = System.Collections.Generic.IEnumerable<'T>
val textReader : StreamReader
Multiple items
type StreamReader =
  inherit TextReader
  new : stream:Stream -> StreamReader + 10 overloads
  member BaseStream : Stream
  member Close : unit -> unit
  member CurrentEncoding : Encoding
  member DiscardBufferedData : unit -> unit
  member EndOfStream : bool
  member Peek : unit -> int
  member Read : unit -> int + 1 overload
  member ReadAsync : buffer:char[] * index:int * count:int -> Task<int>
  member ReadBlock : buffer:char[] * index:int * count:int -> int
  ...

--------------------
StreamReader(stream: Stream) : StreamReader
   (+0 other overloads)
StreamReader(path: string) : StreamReader
   (+0 other overloads)
StreamReader(stream: Stream, detectEncodingFromByteOrderMarks: bool) : StreamReader
   (+0 other overloads)
StreamReader(stream: Stream, encoding: Encoding) : StreamReader
   (+0 other overloads)
StreamReader(path: string, detectEncodingFromByteOrderMarks: bool) : StreamReader
   (+0 other overloads)
StreamReader(path: string, encoding: Encoding) : StreamReader
   (+0 other overloads)
StreamReader(stream: Stream, encoding: Encoding, detectEncodingFromByteOrderMarks: bool) : StreamReader
   (+0 other overloads)
StreamReader(path: string, encoding: Encoding, detectEncodingFromByteOrderMarks: bool) : StreamReader
   (+0 other overloads)
StreamReader(stream: Stream, encoding: Encoding, detectEncodingFromByteOrderMarks: bool, bufferSize: int) : StreamReader
   (+0 other overloads)
StreamReader(path: string, encoding: Encoding, detectEncodingFromByteOrderMarks: bool, bufferSize: int) : StreamReader
   (+0 other overloads)
type Encoding =
  member BodyName : string
  member Clone : unit -> obj
  member CodePage : int
  member DecoderFallback : DecoderFallback with get, set
  member EncoderFallback : EncoderFallback with get, set
  member EncodingName : string
  member Equals : value:obj -> bool
  member GetByteCount : chars:char[] -> int + 3 overloads
  member GetBytes : chars:char[] -> byte[] + 5 overloads
  member GetCharCount : bytes:byte[] -> int + 2 overloads
  ...
property Encoding.Default: Encoding
val not : value:bool -> bool
property StreamReader.EndOfStream: bool
StreamReader.ReadLine() : string
val nodes : GONode list
Multiple items
module List

from Microsoft.FSharp.Collections

--------------------
type List<'T> =
  | ( [] )
  | ( :: ) of Head: 'T * Tail: 'T list
    interface IReadOnlyList<'T>
    interface IReadOnlyCollection<'T>
    interface IEnumerable
    interface IEnumerable<'T>
    member GetSlice : startIndex:int option * endIndex:int option -> 'T list
    member Head : 'T
    member IsEmpty : bool
    member Item : index:int -> 'T with get
    member Length : int
    member Tail : 'T list
    ...
val ofSeq : source:seq<'T> -> 'T list
val skip : count:int -> list:'T list -> 'T list
val map : mapping:('T -> 'U) -> list:'T list -> 'U list
val n : string
val tmp : string []
System.String.Split([<System.ParamArray>] separator: char []) : string []
System.String.Split(separator: string [], options: System.StringSplitOptions) : string []
System.String.Split(separator: char [], options: System.StringSplitOptions) : string []
System.String.Split(separator: char [], count: int) : string []
System.String.Split(separator: string [], count: int, options: System.StringSplitOptions) : string []
System.String.Split(separator: char [], count: int, options: System.StringSplitOptions) : string []
module Table

from FSharpGephiStreamer.Colors
module StatisticalGraphics24

from FSharpGephiStreamer.Colors.Table
val getRandomColor : unit -> Color
val edges : GOEdge list
val find : predicate:('T -> bool) -> list:'T list -> 'T
val n : GONode
namespace FSharpAux
namespace FSharpAux.IO
namespace BioFSharp
namespace BioFSharp.IO
module Obo

from BioFSharp.IO
val readFile : path:string -> OboTerm list
val path : string
module FileIO

from FSharpAux.IO
val readFile : file:string -> seq<string>
val parseOboTerms : input:seq<string> -> seq<OboTerm>
Multiple items
module Seq

from FSharpAux

--------------------
module Seq

from Microsoft.FSharp.Collections

--------------------
type Seq =
  static member fromFile : filePath:string -> seq<string>
  static member fromFileWithCsvSchema : filePath:string * separator:char * firstLineHasHeader:bool * ?skipLines:int * ?skipLinesBeforeHeader:int * ?schemaMode:SchemaModes -> seq<'schema>
  static member fromFileWithSep : separator:char -> filePath:string -> seq<string []>
  static member toCSV : separator:string -> header:bool -> data:seq<'a> -> seq<string>
  static member write : path:string -> data:seq<'a> -> unit
  static member writeOrAppend : path:string -> data:seq<'a> -> unit
val toList : source:seq<'T> -> 'T list
val goObo : OboTerm list
val goNodes : GONode list


Node list containing all GO terms
Multiple items
module List

from FSharpAux

--------------------
module List

from Microsoft.FSharp.Collections

--------------------
type List<'T> =
  | ( [] )
  | ( :: ) of Head: 'T * Tail: 'T list
    interface IReadOnlyList<'T>
    interface IReadOnlyCollection<'T>
    interface IEnumerable
    interface IEnumerable<'T>
    member GetSlice : startIndex:int option * endIndex:int option -> 'T list
    member Head : 'T
    member IsEmpty : bool
    member Item : index:int -> 'T with get
    member Length : int
    member Tail : 'T list
    ...
val x : OboTerm
OboTerm.Id: string
OboTerm.Name: string
OboTerm.Namespace: string
Multiple items
module Colors

from FSharpAux

--------------------
module Colors

from FSharpGephiStreamer
Multiple items
module Table

from FSharpAux.Colors

--------------------
module Table

from FSharpGephiStreamer.Colors
Multiple items
module StatisticalGraphics24

from FSharpAux.Colors.Table

--------------------
module StatisticalGraphics24

from FSharpGephiStreamer.Colors.Table
val goEdges : GOEdge list


Edge list containing all Is-A relationships in the knowledge base
val filter : predicate:('T -> bool) -> list:'T list -> 'T list
val isEmpty : list:'T list -> bool
OboTerm.IsA: string list
val node : GONode
val concat : lists:seq<'T list> -> 'T list
val mapi : mapping:(int -> 'T -> 'U) -> list:'T list -> 'U list
val sourceId : string
val targetId : string
val addOboNode : node:GONode -> Either<string,RestfulAux.Error>
val nodeConverter : (GONode -> Grammar.Attribute list)
module Grammar

from FSharpGephiStreamer
type Attribute =
  | Size of float
  | Color of Color
  | EdgeType of EdgeDirection
  | PositionX of float
  | PositionY of float
  | PositionZ of float
  | Label of string
  | LabelSize of float
  | LabelColor of Color
  | LabelVisible of bool
  ...
union case Grammar.Attribute.Label: string -> Grammar.Attribute
val sprintf : format:Printf.StringFormat<'T> -> 'T
union case Grammar.Attribute.Size: float -> Grammar.Attribute
union case Grammar.Attribute.Color: Color -> Grammar.Attribute
GONode.Color: Color
union case Grammar.Attribute.UserDef: string * obj -> Grammar.Attribute
module Streamer

from FSharpGephiStreamer
val addNode : nodeConverter:Streamer.NodeConverter<'node> -> nodeId:obj -> ('node -> Either<string,RestfulAux.Error>)
val addOboEdge : edge:GOEdge -> Either<string,RestfulAux.Error>
val edge : GOEdge
val edgeConverter : (GOEdge -> Grammar.Attribute list)
union case Grammar.Attribute.EdgeType: Grammar.EdgeDirection -> Grammar.Attribute
type EdgeDirection =
  | Directed
  | Undirected
    static member convert : (EdgeDirection -> bool)
union case Grammar.EdgeDirection.Directed: Grammar.EdgeDirection
val addEdge : edgeConverter:Streamer.EdgeConverter<'edge> -> edgeId:obj -> sourceId:obj -> targetId:obj -> ('edge -> Either<string,RestfulAux.Error>)
Fork me on GitHub