Results 1 to 3 of 3

Thread: First impression and questions on Tuple, reusing space, compression algorithm etc

  1. #1
    Junior Member
    Join Date
    Jan 2016
    Posts
    2

    Default First impression and questions on Tuple, reusing space, compression algorithm etc

    Good day and Happy new Year!

    Just have finished with some tests and would like to thank you for the job.
    Easy to download, comfortable using: no idiot connections string and many other unneeded for market data processing rocks, setups etc.

    So let me ask some question for easier understanding and guiding how to benefit from stsdb as much as possible.
    I have tested struct and class (my choice is struct)
    public struct TradeTick
        {
            public double Price { get; set; }
            public int Volume { get; set; }
            public DateTime Time { get; set; }
    
    
            public string Class { get; set; }
            public string Code { get; set; }
    
    
         }
    
    public struct OrderBookOrder
        {
            public double Price { get; set; }
            public int Volume { get; set; }
        }
       
        public struct OrderBook
        {   
            public IReadOnlyList<OrderBookOrder> SellQueue { get; }
            public IReadOnlyList<OrderBookOrder> BuyQueue { get; }
            public DateTime Time { get; set; }
        
            public string Class { get; set; }
            public string Code { get; set; }
    
    
            public OrderBook(
                List<OrderBookOrder> _SellQueue,
                List<OrderBookOrder> _BuyQueue,
                DateTime _time,            
                string _class,
                string _code
                )
            {
                this.Time = _time;
                this.Class = _class;
                this.Code = _code;
                //
                this.SellQueue = _SellQueue; //this.m_SellQueue;
                this.BuyQueue = _BuyQueue;// this.m_BuyQueue;
                            
            }
        }
    
    So in creating, reading writing struct is little bit better. It was as expected in general value/reference types computing.
    Less memory using, less work to GC.

    Questions:
    1. From your point of view what is better for storing: struct or class?

    2. 10.000.000 records for ticks and orderbooks. Seems ticks file is 2.5-3x times bigger than orderbook file. Is it ok - what type of compression algorithm are you using? Is each record is stored separately? I suppose compression should be not a secret but i am little bit surpised. Writing records time of ticks is shorter but reading time (mean per record) seems little bit faster for bigger struct - orderbook. I suppose decompression doesnt need to much computing and is done on the fly and may be even compared to simple reading/writing in memory time. I suppose mean orderbook reading may be explained by Win IO API.
    2.1. Have you tested your compression algorithm performace versus Snoppy or LZ4? Will LZ4 family be faster in reading? I am more interestd in reading performance. Writing performance with compression is satisfiying.

    3. I have not found you are using super C# type Tuple. If not a secret why? But i ve seen KeyValuePair. Tuple as Dictionary key from my personal test, recommendations by J.Richter, independent tests is outperforming structs.
    I have seen your type definitions and certainly i would like to add this type and test.
    So what do you think about it? How many dependencies you really have? From first look i have to add Tuple definitions only in 2 files defining DataTypes and may be somewhere where persist is working. Not sure about overall logic yet. Do you have your own plans to include it in the code?

    4. Where i can read and understand more on leafs and nodes. To be honest i dont undestand from quik guides what is it? I just understand we have data sequentially or in some chess order stored in the disk file.
    For example i need a cache from each serie ticks/orderbooks adjusted by last 5 minutes - it may be for example 500-2000 elements circullar buffer for ticks and 5000-10000 elements for orderbooks, and another cache for modeling data wich may be static - means from database and streaming - may have some from above cache - what kind of setups would you recommend me with storage engine. What is node and leaf here ? i supposed leaf is data type: ticks for table1, orderbooks for table2.

    5. If i need to get some data: i just call table like dictionnary by keys and everything is done behind? Do i need to think on indexing or create my own?

    6. What type of IPC is used in Server/Client? Protobuff, WCF with custom serialization/deserialization? I need to work with following scenario. I need data server running as service or application responsbile on data recording of real-time data stream and thath's all. Another application responsible on modeling, charting etc should be run separately and subscribe to database server to get static data and may be stream too. I m trying to understand which way will be faster: subscribe to new data even from server application (and i dont like WCF) or subscribe directly to the stream as 2nd channel. But not sure it would be better too.
    Certainly i ll test but need understanding.

    7. Addition to p.6 - Do you have special notifications for client on data updates done? Have not seen in your introduction but i may miss something. If you are telling your persist methods are better: what better to use WCF with binary protocol (passing array of bytes to be right:
    byte[]
    
    ) or same through protobuff for send/receive bytes array and deserialize it using persist methods from your point of view? Or we may try to use protocols of client/server or heap? Have not found this answer in accompanion docs and threads yet.

    8. Is free space (after whole table or table elements erasing) is reused by Stsdb? How file fragmentaion affects stsdb performance?

    9. Do you have something like "vacuum" command to free up free space. Or for example i have to recreate new file where free space will be "forgeten".

    10. Do you have more materials in addition to quick start and developer guide? I need more information on creating indexes, multi key dictionaries. I have seen some information in forum threads but i dont feel comfortable enough with it.
    Seems i ve found answers.

    11. What Commit() method doest exactly? Is it async method? Does it block execution thread? Or it just fires command (or event) and all other mechanics are working in background (calling thread pool). What do i mean? Do i need to organize my queue wich will hold real time data coming during Commit() execution and later i will have to unload this queue after Commit() will return control? Or queue is realised inside tables classes( server or stsdb client) and all i have - just adding new elemetns to table, everything will be synchronized behind the scenes?

    12. What key would you recommend for time series holding. In "Getting started" introduction there are some suggestions:
    public class Key
    {
        public string Symbol { get; set; }
        public DateTime Timestamp { get; set; }
    }
     
    public class Provider
    {
        public string Name { get; set; }
        public string Website { get; set; }
    }
     
    public class Tick
    {
        public double Bid { get; set; }
        public double Ask { get; set; }
        public int BidSize { get; set; }
        public int AskSize { get; set; }
        public Provider Provider { get; set; }
    }
    
    But what is happening behind it? How it affects lookup performance?
    What will be better: to store all symbols of one data type in one table or create new table for every symbol?
    Last version looks logically better. I suppose new file creation for every symbol and data type will be much faster but who knows.

    13. If i need to load some time series with time range: from date 1 to date 2, from 13h25m30s to 13h25m38s.
    Do i need to create custom comparer? custom look up methods?
    If you could provide with 2 examples:
    a.) search and getting data range
    b.) search and point cursor on data range beginning
    I hope you have methods without any LINQ.

    Seems i ve found answers.

    14. In reality i need 2 kinds of requests:
    - to load data range and keep it in temporary memory cache;
    - to iterate through data range loading only one tick or orderbook at a time needed for one pass recursive metrics.
    What kind of cache tuning will be better for each scenario? I suppose best answer is accordingly to windows cluster size and more if you do not work with Windows IO Api tuning. Is it right thinking guidance?

    15. I have found in one of your thread that "engine" sorts data by keys. Is it done automatically? or just in table returned by request.
    Do you have special command to sort physically in tables stored on the disk?

    16. Concerning compression.From guide i understood that compression is firing exactly after Commint() command. Is it possible to tune compression moment. What i mean - data must be compressed at storing moment and decompressed by calling table[key] method (better say this call should return new object of DataType)? Simple known logic behind - if we store data in compressed mode and getting DataType object by access we have benefits. Or you suggest to use memory cache mode - above scenario is looking convenient here. But in my tests with my data types disk reading / writing ratio is 7/1-8/1 (i suppose time is mostly consumed by compression). So if i need to save table elements to disk i have call in memory base(they are decompressed) pass an DataType object to file storage engine wich will try to compress again. Is it possible to have an economy in this already compressed data passing?

    Sorry friends for too many questions from beginning. But i hope it ll help to understand what is stsdb and how to use it at 200%.
    It seems to me you been developing this engine trying to find not general decision but inspired by tasks highly correlated to my questions.
    And it is well known - no information - no questions. Interesting results guide to more questions.

    Thank you very much in advance.
    Last edited by Alexey_KK; 06.01.2016 at 07:27.

  2. #2

    Default

    Hi,

    Thank you for evaluating STSdb 4.0.

    About your questions.

    First, according to the STSdb 4.0 Developer's Guide, the database supports the following types:

    1. For keys:
      1. all linear types
      2. Enums
      3. Guid

    2. For records:
      1. all anonymous types
      2. Enums
      3. Guid
      4. Classes (with public default constructor) and structures, containing public read/write properties or fields with types from the 3 groups
      5. T[], List<T>, Dictionary<K, V> , KeyValuePair<K, V> and Nullable<T>, where T, K and V are types from the above 4 groups
      6. Classes (with public default constructor) and structures, containing public read/write properties or fields with types from the 5 groups

    The definition is recursive and the engine can handle very complex types. For example we can support Dictionary<DateTime, List<KeyValuePair<long, string[]>>>.

    A .NET type in STSdb 4.0 is linear, if it is one of the following types:
    1. Primitive type - Boolean, Char, SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64, Single, Double, Decimal, DateTime, TimeSpan, String, byte[];
    2. Classes (with public default constructor) and structures, containing public read/write properties or fields with primitive types;

    For example, in our terms DateTime, Int64 and String types are linear types. The following Tick type is also linear:
    public class Tick
    {
        public string Symbol { get; set; }
        public DateTime Timestamp { get; set; }
        public double Price { get; set; }
    }
    
    because it is built only from primitive types.

    Let's get back to your types.

    Your example contains types that are not supported by the engine:
    public struct OrderBookOrder
        {
            public double Price { get; set; }
            public int Volume { get; set; }
        }
       
        public struct OrderBook
        {   
            public IReadOnlyList<OrderBookOrder> SellQueue { get; }
            public IReadOnlyList<OrderBookOrder> BuyQueue { get; }
    
            public DateTime Time { get; set; }
            public string Class { get; set; }
            public string Code { get; set; }
            ...
        }
    
    The engine will support (and can fully store) the OrderBookOrder itself, but not the IReadOnlyList<OrderBookOrder> properties.

    When the engine reaches unsupported types, it ignores them. Thus, in your case STSdb 4.0.x will make only partial serialization of your OrderBook objects:

        using (IStorageEngine engine = STSdb.FromFile("test.stsdb4"))
        {
            ITable<long, OrderBook> table = engine.OpenXTable<long, OrderBook>("table");
    
            for (int i = 0; i < 10; i++)
            {
                var buyQueue = Enumerable.Range(1, 100).Select(x => new OrderBookOrder() { Price = x, Volume = x }).ToList();
                var sellQueue = Enumerable.Range(1, 100).Select(x => new OrderBookOrder() { Price = x, Volume = x }).ToList();
    
                OrderBook orderBook = new OrderBook(sellQueue, buyQueue, DateTime.Now, i.ToString(), i.ToString());
    
                //all OrderBook objects will be partially stored - only the Time, Class and Code properties will be serialized
                table[i] = orderBook;
            }
    
            engine.Commit();
        }
    
        using (IStorageEngine engine = STSdb.FromFile("test.stsdb4"))
        {
            ITable<long, OrderBook> table = engine.OpenXTable<long, OrderBook>("table");
    
            foreach (var kv in table.Forward())
            {
                long key = kv.Key;
                OrderBook orderBook = kv.Value;
    
                var buyQueue = orderBook.BuyQueue;
                var sellQueue = orderBook.SellQueue;
    
                //buyQueue & sellQueue will be NULL, because they were ignored in the serialization process.
    
                Debug.Assert(orderBook.BuyQueue == null);
                Debug.Assert(orderBook.SellQueue == null);
            }
        }
    
    Here even if we change the BuyQueue & SellQueue types to List<OrderBookOrder> the engine will still ignore them:
        public struct OrderBook
        {   
            public List<OrderBookOrder> SellQueue { get; } //both will be ignored, because there is only a getter
            public List<OrderBookOrder> BuyQueue { get; }
    
            public DateTime Time { get; set; }
            public string Class { get; set; }
            public string Code { get; set; }
            ...
        }
    
    The reason is that the engine can store only read and write properties/fields. So if we add a setter the engine can finally store/restore the whole objects
        public struct OrderBook
        {   
            public List<OrderBookOrder> SellQueue { get; set; } //it's ok now
            public List<OrderBookOrder> BuyQueue { get; set; }
    
            public DateTime Time { get; set; }
            public string Class { get; set; }
            public string Code { get; set; }
            ...
        }
    
    Another very important thing in the serialization process is that STSdb 4.0.x does not store relations between objects - it stores only the objects content!.

    For example if we have 1 List<OrderBookOrder> instance:
    List<OrderBookOrder> queue = Enumerable.Range(1, 100).Select(x => new OrderBookOrder() { Price = 1.0, Volume = 1 }).ToList();
    
    and we store 2 different OrderBook instances that refers to this queue instance:
        OrderBook orderBook1 = new OrderBook(queue , null, DateTime.Now, "orderBook1", "");
        OrderBook orderBook1 = new OrderBook(queue , null, DateTime.Now, "orderBook2", "");
    
        table[1] = orderBook1;
        table[2] = orderBook2;
    
    When we reopen the database and read the records from the table, they will refer to 2 completely independent sell queue instances that have identical content. In the serialization process the queue is one instance - but it will be stored twice: first as the orderBook1 member and second - as the orderBook2 member.

    When the engine serializes the keys and records it generates .NET expressions for the fastest serialization possible. During the store process the engine does not store the objects relations, it only serialize their content through the prepared .net expressions. If the developer needs to keep relations between stored objects he has to do it manually (perhaps through some id logic).

    Now to your questions.

    1. From your point of view what is better for storing: struct or class?
    From database point of view there is no difference between storing structs and classes in terms of performance.

    For example for the class Tick1 and for the struct Tick2 the engine will generates almost identical lambda expressions:

    Class Tick1:
    //class
    public class Tick1
    {
        public string Symbol { get; set; }
        public DateTime Timestamp { get; set; }
        public double Price { get; set; }
    }
    
    //store generated code
    .Lambda #Lambda1<System.Action`2[System.IO.BinaryWriter,STSdb4.GettingStarted.Tick1]>(
        System.IO.BinaryWriter $var1,
        STSdb4.GettingStarted.Tick1 $var2) {
        .Block() {
            .Call $var1.Write($var2.Symbol);
            .Call $var1.Write(($var2.Timestamp).Ticks);
            .Call $var1.Write($var2.Price)
        }
    }
    
    //load generated code
    .Lambda #Lambda1<System.Func`2[System.IO.BinaryReader,STSdb4.GettingStarted.Tick1]>(System.IO.BinaryReader $reader) {
        .Block(STSdb4.GettingStarted.Tick1 $var1) {
            $var1 = .New STSdb4.GettingStarted.Tick1();
            $var1.Symbol = .Call $reader.ReadString();
            $var1.Timestamp = .New System.DateTime(.Call $reader.ReadInt64());
            $var1.Price = .Call $reader.ReadDouble();
            .Label
                $var1
            .LabelTarget #Label1:
        }
    }
    
    Struct Tick2:
    //struct
    public struct Tick2
    {
        public string Symbol { get; set; }
        public DateTime Timestamp { get; set; }
        public double Price { get; set; }
    }
    
    //store generated code
    .Lambda #Lambda1<System.Action`2[System.IO.BinaryWriter,STSdb4.GettingStarted.Tick2]>(
        System.IO.BinaryWriter $var1,
        STSdb4.GettingStarted.Tick2 $var2) {
        .Block() {
            .Call $var1.Write($var2.Symbol);
            .Call $var1.Write(($var2.Timestamp).Ticks);
            .Call $var1.Write($var2.Price)
        }
    }
    
    //load generated code
    .Lambda #Lambda1<System.Func`2[System.IO.BinaryReader,STSdb4.GettingStarted.Tick2]>(System.IO.BinaryReader $reader) {
        .Block(STSdb4.GettingStarted.Tick2 $var1) {
            $var1 = .New STSdb4.GettingStarted.Tick2();
            $var1.Symbol = .Call $reader.ReadString();
            $var1.Timestamp = .New System.DateTime(.Call $reader.ReadInt64());
            $var1.Price = .Call $reader.ReadDouble();
            .Label
                $var1
            .LabelTarget #Label1:
        }
    }
    
    As we can see the generated code is practically identical. The only difference in performance may come from the .NET time of creating struct and class instances.

    You can read more about the integrated persist logic in STSdb in the STS Labs article Persist<T> - fast objects serialization

    2. 10.000.000 records for ticks and orderbooks. Seems ticks file is 2.5-3x times bigger than orderbook file. Is it ok - what type of compression algorithm are you using? Is each record is stored separately? I suppose compression should be not a secret but i am little bit surpised. Writing records time of ticks is shorter but reading time (mean per record) seems little bit faster for bigger struct - orderbook. I suppose decompression doesnt need to much computing and is done on the fly and may be even compared to simple reading/writing in memory time. I suppose mean orderbook reading may be explained by Win IO API.
    2.1. Have you tested your compression algorithm performace versus Snoppy or LZ4? Will LZ4 family be faster in reading? I am more interestd in reading performance. Writing performance with compression is satisfiying.
    STSdb 4.0 has two main strategies for the stored objects - with compression and with raw serialization.

    For each table two separate serialization logics are assigned - one for the table keys and one for the table records
    .
    var table1 = engine.OpenXTable<TKey, TRecord>("table1");
    
    In the above example:
    • if TKey is a linear type, all keys in the table will be serialized via generated compression for the TKey type;
    • f TKey is not a linear type, all keys in the table will be serialized raw (without compression);
    • if TRecord is linear type, all records in the table will be serialized via generated compression for the TRecord type;
    • if TRecord is not linear type, all records in the table will be serialized raw (without compression).

    In all cases, all of the serialization logic is generated run-time trough lambda expressions. Each expression is generated and compiled once during a database session.

    The generated run-time code for raw serialization of an object is very similar to the code that the user could write if he had to write it manually (enumerate sub-objects recursively, write their properties etc.).
    The generated run-time code for objects compression is complex. STSdb 4.0.x uses expressions to generate compression logic for each separate TRecord type. Each expression represents parallel vertical compression. The compression stores objects of one type grouped by their members.

    For example, if we have table with records of type Tick:
    public class Tick1
    {
        public string Symbol { get; set; }
        public DateTime Timestamp { get; set; }
        public double Price { get; set; }
    }
    
    The engine will store it by compressing their properties separately: all Symbol properties in one compressed package; all Timestamps in a second package; and all Prices in a third package.

    For the string values the engine will choose one sub-compression logic, for the DateTime values - second sub-compression logic, for the double values - third sub-compression logic. For example, for double/integer etc. values by default the engine will use a very optimal and fast delta-compression; for string values it will use custom dictionary compression etc.

    Within one vertical compression logic for a type all choosed sub-compression logics are executed in parallel.

    So the whole generated compression is quite complex. But very fast itself. And with good compression ratio.

    By the way, you can try the compression completely independent from the database. All you have to do is to create a generic IndexerPersist instance for the needed type:
            IndexerPersist<Tick> indexerPersist = new IndexerPersist<Tick>();
    
    Here is an example:
        //generate some ticks
        List<Tick> ticks = TicksGenerator.GetFlow(10000, KeysType.Random).Select(kv => kv.Value).ToList();
    
        //compression
        using (MemoryStream ms = new MemoryStream())
        {
            //create compression logic
            IndexerPersist<Tick> indexerPersist = new IndexerPersist<Tick>();
    
            //compress
            indexerPersist.Store(new BinaryWriter(ms), (idx) => ticks[idx], ticks.Count);
    
            //decompress
            List<Tick> tmp = new List<Tick>();
            ms.Seek(0, SeekOrigin.Begin);
            indexerPersist.Load(new BinaryReader(ms), (idx, tick) => tmp.Add(tick), ticks.Count);
        }
    
        //raw serialization
        using (MemoryStream ms = new MemoryStream())
        {
            //create raw serialization logic
            Persist<Tick> persist = new Persist<Tick>();
    
            //write tick by tick
            BinaryWriter writer = new BinaryWriter(ms);
    
            for (int i = 0; i < ticks.Count; i++)
            {
                persist.Write(writer, ticks[i]);
            }
    
            //read
            BinaryReader reader = new BinaryReader(ms);
            List<Tick> tmp = new List<Tick>();
            ms.Seek(0, SeekOrigin.Begin);
    
            for (int i = 0; i < ticks.Count; i++)
            {
                Tick tick = persist.Read(reader);
    
                tmp.Add(tick);
            }
        }
    
    You can use IndexerPersist with any linear type T. (In the example, the Tick type and the TickGenerator class are in the GettingStarted project.)

    As for the table write and read speed in your question they depend mostly on the current internal WaterfallTree state of the database, not so much from the compression. It depends wheither there are records in the internal nodes for pouring down or most of the records in the W-Tree are already in the leaves...

    We have tested our compression (separately from STSdb) vs. Snappy compression, LZ4 and vs. the Deflate compression in .NET. I don't remember the results. But we were much better both by time and by compression ratio.

    Of course, the default persist logic for a table can be replaced. See the Developer's Guide for details.

    3. I have not found you are using super C# type Tuple. If not a secret why? But i ve seen KeyValuePair. Tuple as Dictionary key from my personal test, recommendations by J.Richter, independent tests is outperforming structs.
    I have seen your type definitions and certainly i would like to add this type and test.
    So what do you think about it? How many dependencies you really have? From first look i have to add Tuple definitions only in 2 files defining DataTypes and may be somewhere where persist is working. Not sure about overall logic yet. Do you have your own plans to include it in the code?
    This is a good question. Yes, the Tuple classes in .NET are not supported by STSdb 4.0.x. The reasons is simple: tuple classes has properties that have only getters (why, Microsoft, why?). That's why Tuples are not supported by the engine. Of course we can always write additional persist logic to support them - as we do it for the KeyValuePair type. But at least for now Tuples are not supported.

    Instead of Tuples STSdb 4.0.x offer classes that have very similar logic - Slotes<>. But with read/write members:
        public interface ISlots
        {
        }
    
        [Serializable]
        public class Slots<TSlot0> : ISlots
        {
            public TSlot0 Slot0;
    
            public Slots()
            {
            }
    
            public Slots(TSlot0 slot0)
            {
                Slot0 = slot0;
            }
        }
    
        [Serializable]
        public class Slots<TSlot0, TSlot1> : ISlots
        {
            public TSlot0 Slot0;
            public TSlot1 Slot1;
    
            public Slots()
            {
            }
    
            public Slots(TSlot0 slot0, TSlot1 slot1)
            {
                Slot0 = slot0;
                Slot1 = slot1;
            }
        }
    
        [Serializable]
        public class Slots<TSlot0, TSlot1, TSlot2> : ISlots
        {
            public TSlot0 Slot0;
            public TSlot1 Slot1;
            public TSlot2 Slot2;
    
            public Slots()
            {
            }
    
            public Slots(TSlot0 slot0, TSlot1 slot1, TSlot2 slot2)
            {
                Slot0 = slot0;
                Slot1 = slot1;
                Slot2 = slot2;
            }
        }
    
    And so on to 16 slotes...

    These are classes with generic field members. You can use it safely as key or record types in STSdb tables. These classes are located in STSdb4.Data namespace. And actually, when the user uses portable tables the databaes engine uses these Slots types underline to transform user objects to slot objects.

    4. Where i can read and understand more on leafs and nodes. To be honest i dont undestand from quik guides what is it? I just understand we have data sequentially or in some chess order stored in the disk file.
    For example i need a cache from each serie ticks/orderbooks adjusted by last 5 minutes - it may be for example 500-2000 elements circullar buffer for ticks and 5000-10000 elements for orderbooks, and another cache for modeling data wich may be static - means from database and streaming - may have some from above cache - what kind of setups would you recommend me with storage engine. What is node and leaf here ? i supposed leaf is data type: ticks for table1, orderbooks for table2.
    You can read more about the tree behind STSdb 4.0.x in About WaterfallTree article.

    In STSdb 4.0.x each database instance (StorageEngine) is actually a WaterfallTree instance. As a tree each WaterfallTree instance has internal nodes and leave nodes. The nodes, no matter internal or leaves, contains user data - records and keys.
    All tables in one database instance shares one WaterfallTree instance.

    The cache in STSdb 4.0.x is at node level. This cache is common for all tables in one database. During database using the cache contains most recently used WTree nodes - or from data user point of view: most recent accessed user data by nodes. Thus, you have no control of how this cache is distributed between tables. You can only control the maximum number of nodes that the engine can hold into memory.

    If you access principally the last rows from a time series (a table) the cache will actually keep also the nodes that contains these records. If you want explicit caching of last records for a table I suppose you have to make some lightweight layer between your processing tools and database.
    5. If i need to get some data: i just call table like dictionnary by keys and everything is done behind? Do i need to think on indexing or create my own?
    Yes. Each table looks like ordered dictionary. You can lookup by key - get or set some data (by key), enumerate all records (in ascending or descending order) or just get some records in range (from key to key).

    But don't forget that all changes in all tables will became permanent only when you invoke engine.Commit(). With Commit() all changes in all tables (since last database commit) will pass or not pass. If you make changes of table(s) and don't invoke commit after that, these changes will be lost when you close the database.

    Each STSdb4 table has only one index - by key. This is the primary index. There are no secondary indexes; there are no indexes by members of the record type. If you need secondary index logic you have to organize it manually.

    6. What type of IPC is used in Server/Client? Protobuff, WCF with custom serialization/deserialization? I need to work with following scenario. I need data server running as service or application responsbile on data recording of real-time data stream and thath's all. Another application responsible on modeling, charting etc should be run separately and subscribe to database server to get static data and may be stream too. I m trying to understand which way will be faster: subscribe to new data even from server application (and i dont like WCF) or subscribe directly to the stream as 2nd channel. But not sure it would be better too.
    Certainly i ll test but need understanding.
    STSdb 4.0 client/server API uses TCP/IP with exactly the same generated lambda expressions to transfer keys and records between client and server. In the transfer the data is never compressed. But It may be compressed on the server side (by the same rules).

    The only significant differences between client/server mode and embedded (direct) mode is that in the client/server mode we stores user data on the server as anonymous data. And when the client has to read it the data is received from the server as anonymous and is transformed in the client to TKey and TRecord types via automatically generated transformers. This is done so that every remote client to make access to the specified table, not matter of fact that TKey & TRecord of one client may be different from TKey & TRecord of another client.

    In embedded (direct) mode there are not such transformers. Both the user and database engine works directly with the user types (to the lowest database level).

    About your subquestion. I'm not sore which will be better - "subscribe to new data even from server application ... or subscribe directly to the stream as 2nd channel". It depend on how you will use the data - whither you need historical access or need only last data access (for realtime charts, for example)... Third variant might be to unite the service app and chart/data processing app and use the database as embedded...

    If your data streams are heavy you can think also to make some database partition - to logically group data by weeks, months... To prevent having one really big file.

    7. Addition to p.6 - Do you have special notifications for client on data updates done? Have not seen in your introduction but i may miss something. If you are telling your persist methods are better: what better to use WCF with binary protocol (passing array of bytes to be right:
    1
    byte[]
    ) or same through protobuff for send/receive bytes array and deserialize it using persist methods from your point of view? Or we may try to use protocols of client/server or heap? Have not found this answer in accompanion docs and threads yet.
    No. We don't have any special notifications for a client when another client updates a record. Two clients may access single table in client/server mode simultaneously. Their changes will be visible to each other, but there are no notifications.

    8. Is free space (after whole table or table elements erasing) is reused by Stsdb? How file fragmentaion affects stsdb performance?
    Yes, STSdb reuses the space. The space allocation philosophy of the current implementation is balanced between speed and density.

    The Heap under WTree always stores data on nodes (internal or leave). Because the nodes in WTree are relatively large - up to several MB, the fragmentation (where it occured) does not affect the performance. By default when a node has to be stored the heap engine always try to allocate a block space near after last allocated block. However the default Heap behavior can be changed:
        //default
        IHeap heap = new Heap("test.stsdb4", false, AllocationStrategy.FromTheCurrentBlock);
    
        //more compact but may affect with 4-5% the performance
        IHeap heap = new Heap("test.stsdb4", false, AllocationStrategy.FromTheBeginning);
    
        using (IStorageEngine engine = STSdb.FromHeap(heap))
        {
        }
    
    When a table (or records of a table) is being deleted the relevant backend W-Tree nodes is gradually physically removed. This is not happen immediately, but in time - while the engine changes the W-Tree (and these nodes) during other database or table operations.

    When a tree node became completely unrefferenced its space is recylced by the heap and can be reused. In Commit() when a new version of a node is safely stored its old version is also recycled.

    After every Commit() the backend heap file is also truncated (as much as possible).

    9. Do you have something like "vacuum" command to free up free space. Or for example i have to recreate new file where free space will be "forgeten".
    No, there is no such command. As we saying above the unused between-block spaces is gradually reused during database using.

    10. Do you have more materials in addition to quick start and developer guide? I need more information on creating indexes, multi key dictionaries. I have seen some information in forum threads but i dont feel comfortable enough with it.
    You can read the Developer's Guide. The .pdf file is inside every STSdb 4 release package.

    11. What Commit() method doest exactly? Is it async method? Does it block execution thread? Or it just fires command (or event) and all other mechanics are working in background (calling thread pool). What do i mean? Do i need to organize my queue wich will hold real time data coming during Commit() execution and later i will have to unload this queue after Commit() will return control? Or queue is realised inside tables classes( server or stsdb client) and all i have - just adding new elemetns to table, everything will be synchronized behind the scenes?
    The Commit() method stores all changes in all tables in the current database. Atomically. All or nothing. The Commit() is always at database level (there is no Commit per table).

    The Commit() method actually safely stores all changed (by the tables) WaterfallTree nodes. The store process of each node is realized with copy-on-write technique.

    The Commit() as itself is a synchronous method - it block the thread that invoke it, until all is done. Principally the Commit method does not block other threads (if any) that works with the tables. But its a race condition - whether the table changes will pass before Commit() or after Commit(). The Commit() always strives to keep each of the W-Tree nodes with minimal locks.

    So if you invoke Commit() from one thread, you can insert data from other threads simultaneously. It is possible during the commit process to change tables from other threads. But it is note sure what part of the changes will pass before or after Commit().

    12. What key would you recommend for time series holding. In "Getting started" introduction there are some suggestions:
    public class Key
    {
        public string Symbol { get; set; }
        public DateTime Timestamp { get; set; }
    }
     
    public class Provider
    {
        public string Name { get; set; }
        public string Website { get; set; }
    }
     
    public class Tick
    {
        public double Bid { get; set; }
        public double Ask { get; set; }
        public int BidSize { get; set; }
        public int AskSize { get; set; }
        public Provider Provider { get; set; }
    }
    
    But what is happening behind it? How it affects lookup performance?
    What will be better: to store all symbols of one data type in one table or create new table for every symbol?
    Last version looks logically better. I suppose new file creation for every symbol and data type will be much faster but who knows.
    When a user open a table with primitive or composite keys the engine generates and compiles (with lambda expressions) comparison code especially for that type of key. The generated code is for maximum possible performance.
    You can read more about generated compares in STS Labs section:
    Comparer<T> - fast key comparer
    The integrated comparers can be also used independent from the database.

    What will be better: to store all symbols of one data type in one table or create new table for every symbol?
    STSdb 4 stores all tables in one WTree instance and can works perfectly with parallel inserts in many tables. We have some tests with hundreds of tables, where each insert is made by separate thread. The speed is very good. Sometimes it is better than single insert from one thread.

    Here is some results:
    Insert 500 000 000 records with random keys:
    • in 1 table from 1 thread:
      Write speed: 69 000 rec/sec
      Read speed: 328 500 rec/sec
      Secondary read: 986 509 rec/sec
    • in 5 tables from 5 threads (x 100 000 000 records):
      Write speed: 82 000 rec/sec
      Read speed: 291 362 rec/sec
      Secondary read: 994609 rec/sec

    The database parallelism is quite good.

    13. If i need to load some time series with time range: from date 1 to date 2, from 13h25m30s to 13h25m38s.
    Do i need to create custom comparer? custom look up methods?
    If you could provide with 2 examples:
    a.) search and getting data range
    b.) search and point cursor on data range beginning
    I hope you have methods without any LINQ.
    No, you don't have to create custom comparer. Just open the table and read data - directly by key or in range.

    You can use both primitive types as key (DateTime, int, string, long etc.) or composite key type (some class with primitive members). In both cases the automatically generated comparer code will be optimal.

    In STSdb there is no concept for cursor - the tables does not keep any state when user queries it.

    Here is some examples of table using:
            //open table
            ITable<DateTime, Tick> table = engine.OpenXTable<DateTime, Tick>("table");
    
            //get all records in the specified key range
            DateTime from = DateTime.Now.AddHours(-10);
            DateTime to = from.AddHours(1);
    
            foreach (var kv in table.Forward(from, true, to, true))
            {
            }
    
            //get all records from key 'from' to the end (ignores 'to')
            foreach (var kv in table.Forward(from, true, to, false))
            {
            }
    
            //find first record (if any) with key >= from
            var kv = table.FindNext(from);
    
            //find first record (if any) with key > from
            var kv = table.FindAfter(from);
    
            //get the record (if any) with the smallest key 
            var kv = table.FirstRow;
    
            //get the record (if any) with the largest key
            var kv = table.LastRow;
    
    14. In reality i need 2 kinds of requests:
    - to load data range and keep it in temporary memory cache;
    - to iterate through data range loading only one tick or orderbook at a time needed for one pass recursive metrics.
    What kind of cache tuning will be better for each scenario? I suppose best answer is accordingly to windows cluster size and more if you do not work with Windows IO Api tuning. Is it right thinking guidance?
    As I say, you have no control of the database cache system (except the total size of the cache). The cache is shared between all tables in a database instance. If you want to distribute the available memory more precisely you can try to keep the tables in different database instances.

    15. I have found in one of your thread that "engine" sorts data by keys. Is it done automatically? or just in table returned by request.
    Do you have special command to sort physically in tables stored on the disk?
    Both. The WaterfallTree keeps the records almost sorted... They are sorted on global level (by nodes), but they are not sorted on local level (within each node). When you query a table in some range or point the engine makes small sorts of the unsorted keys in the unsorted node. The table is sorted more and more in time... Thus from the user point of view the table is completely sorted. But physically there are lot of key areas that has never been queried by the user and stays not sorted. This W-Tree behavior is fundamentally different from B-trees indexed structures (where the data is always strictly sorted).

    16. Concerning compression.From guide i understood that compression is firing exactly after Commint() command. Is it possible to tune compression moment. What i mean - data must be compressed at storing moment and decompressed by calling table[key] method (better say this call should return new object of DataType)? Simple known logic behind - if we store data in compressed mode and getting DataType object by access we have benefits. Or you suggest to use memory cache mode - above scenario is looking convenient here. But in my tests with my data types disk reading / writing ratio is 7/1-8/1 (i suppose time is mostly consumed by compression). So if i need to save table elements to disk i have call in memory base(they are decompressed) pass an DataType object to file storage engine wich will try to compress again. Is it possible to have an economy in this already compressed data passing?
    As we saying above, for each table depending on the table key and table record types the engine chooses type of serialization for each of the types - compression or raw mode. This, however affects the records only in the leave nodes. We decided to do this to prevent unnecessary compressions/decompressions - because it is assumed that the dynamics of the changes in internal W-Tree nodes are more in comparison to leave nodes changes. So the engine always stores its records in the internal nodes in raw mode. And stores the records in the leave nodes depending on engine decision - compression or no compression.

    The compression (and generally the serialization) is made when a W-Tree node has to be stored. This can occur in Commit() method, but this may also occur when a node expires from the database cache. In both cases the compression/serialization is done just before the store process. So it is not possible to tune compression moment.

    The database cache always keeps the W-Tree nodes in unserialized, i.e. as live user objects.


    I hope these notes helps.
    Thank you for your good questions!
    A.
    Last edited by a.todorov; 25.01.2016 at 14:35.

  3. #3
    Junior Member
    Join Date
    Jan 2016
    Posts
    2

    Default

    Thank you very much for explanations. Now main questions are clear.
    Looks like nice 19 page addition to developers guide
    Little bit delayed - just returned to database connections tuning.
    Dont know why i ve missed your answer.
    Seems i ll have some more questions on IPC and server/client behaviour.
    TCP/IP: i suppose you mean sockets or pipes as layer.
    I ve read STS DB client/server version is faster than single mode version.
    Can you explain why?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
2002 - 2014 STS Soft SC. All Rights reserved.
STSdb, Waterfall Tree and WTree are registered trademarks of STS Soft SC.