Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the overhead for each table and each key space? #364

Closed
yjiangnan opened this issue Jul 6, 2018 · 12 comments
Closed

What is the overhead for each table and each key space? #364

yjiangnan opened this issue Jul 6, 2018 · 12 comments

Comments

@yjiangnan
Copy link

yjiangnan commented Jul 6, 2018

In Rethinkdb, each table has an overhead of 8M memories. Is this similar in YugaByte? Does creating more tables and key spaces add extra overhead? Is key space just a logical structure without any significant overhead?

If I have millions or more of users and each user has millions of transaction records (time series) to keep, what would be the optimal data structure? (Let's say I want to build a performance evaluation for each user based on their transaction histories.) Do I create a table for each user, or do I use a single table for keeping all the records and do a query (search the whole table?) for the user when needed? Or do I use a list in a json structure of user info and append to the list when update? What is the tradeoff between these different approaches?

@ddorian
Copy link
Contributor

ddorian commented Jul 6, 2018

Depends on what type of queries (and frequency of each) you want to make. So list all queries with how often they will happen.

And in every db it's better to have low number of tables in low thousands. Prepending with user_id will enable compression and be fairly efficient.

Depending how the json is stored internally, which currently is a blob, it may be more efficient to group values for a period of time together (not too big and not too small period) to lower the overhead of each value.

@kmuthukk
Copy link
Collaborator

kmuthukk commented Jul 7, 2018

Hi @yjiangnan

I would suggest using a single table with a compound primary key for this type of use case.

Here's a sample snippet, with some embedded explanation/notes.

CREATE KEYSPACE IF NOT EXISTS app;
USE app;

DROP TABLE IF EXISTS user_actions;

// Note: The primary key is a compound primary key based on userid & action_id.
// The extra parens around userid (in CQL parlance) implies that it is be used as 
// the partitioning key. The subsequent portions of the key are clustered/range ordered.
// This ensures that all data for a particular user falls on the same shard (tablet)
// and retrieving all of a user's actions or specific ranges (such as top-N actions or 
// from action_id <M> through <N>) are very efficient and will not incur a full-table scan.
//
// Advantages: The number of actions per user can grow very large. Millions of
// rows per userid key should not be a problem. Each row is an independent
// unit and adding a new action for a user does not incur a read-modify-write 
// overhead.  
CREATE TABLE user_actions (userid int, action_id int, payload text,
        PRIMARY KEY ((userid), action_id))
        WITH CLUSTERING ORDER BY (action_id DESC);

INSERT INTO user_actions (userid, action_id, payload) VALUES (1, 1, 'a');
INSERT INTO user_actions (userid, action_id, payload) VALUES (1, 2, 'b');
INSERT INTO user_actions (userid, action_id, payload) VALUES (1, 3, 'c');
INSERT INTO user_actions (userid, action_id, payload) VALUES (1, 4, 'd');
INSERT INTO user_actions (userid, action_id, payload) VALUES (1, 5, 'e');
INSERT INTO user_actions (userid, action_id, payload) VALUES (1, 6, 'f');
INSERT INTO user_actions (userid, action_id, payload) VALUES (1, 7, 'g');
INSERT INTO user_actions (userid, action_id, payload) VALUES (1, 8, 'h');

INSERT INTO user_actions (userid, action_id, payload) VALUES (2, 1, 'l');
INSERT INTO user_actions (userid, action_id, payload) VALUES (2, 2, 'm');
INSERT INTO user_actions (userid, action_id, payload) VALUES (2, 3, 'n');
INSERT INTO user_actions (userid, action_id, payload) VALUES (2, 4, 'o');
INSERT INTO user_actions (userid, action_id, payload) VALUES (2, 5, 'p');

Sample Query:

// Get most recent two actions for userid 1.
SELECT   action_id, payload
FROM     user_actions
WHERE    userid = 1
ORDER BY action_id DESC
LIMIT 2;

Output:

 action_id | payload
-----------+---------
         8 |       h
         7 |       g

Sample Query:

// Get actions for action_id 4 through 7 for userd 1.
SELECT   action_id, payload
FROM     user_actions
WHERE    userid = 1 and action_id >= 4 and action_id <= 7
ORDER BY action_id DESC;

Output:

 action_id | payload
-----------+---------
         7 |       g
         6 |       f
         5 |       e
         4 |       d

@kmuthukk
Copy link
Collaborator

kmuthukk commented Jul 7, 2018

Additionally, if need be, you can also use a table-level or row-level TTL to retire old data (the transaction history data in your case) automatically.

@yjiangnan
Copy link
Author

@kmuthukk Wow, that is really an excellent design and crystal clear explanation! Thanks!

Since you have already specified the clustering key in PRIMARY KEY ((userid), action_id)), is WITH CLUSTERING ORDER BY (action_id DESC) really necessary? Isn't it the default behavior to sort the data according to the clustering key? Additionally, does sorting by ASC order has the same performance?

@kmuthukk
Copy link
Collaborator

kmuthukk commented Jul 7, 2018

Hi @yjiangnan

Good observation/question with regards to the DESC/ASC. You are correct that no matter what (i.e. whether ASC or DESC is used for CLUSTERING ORDER BY) - the clustering columns in the PRIMARY key are stored in a sorted fashion. The question is: is their on-disk (and in-block) order ascending or descending?

In terms of the big O notation, the two approaches are comparable and both should be O(log(N)) to get to the position of interest, and from there on O(K), where K is the number of records you want to retrieve.

However, if your queries are going to be mostly for recent actions (such as top-N recent actions, or, if you will paginate through this from most recent action first) then giving that hint to the system to store in DESC sort order (compared to default ASC sort order) is marginally better in terms of CPU. This is because reverse scanning through database blocks is slightly less efficient than forward scans.

@yjiangnan
Copy link
Author

Hi @kmuthukk

If users want to query public histories where user ids are hidden but each record has an id or time stamp, how can I force adjacent records to locate in the same tablet for faster access? Users would query a range of time or id (e.g. id > 55500, limit=1000). Can I do something like PRIMARY KEY ((int(id/10000)), time))? Or do I have to create a column to store int(id/10000)?

@kmuthukk
Copy link
Collaborator

kmuthukk commented Jul 7, 2018

If there isn't a natural user id to partition by, and you want to store records in time order, and co-locate different records from a nearby time range to be on the same tablet, then you'll have to do the bucketization (say hourly, or daily bucket) yourself-- and you can do so by creating again a compound key with two parts - where the first part is the partition key (into which you would insert "time/3600" say for an hourly bucketization), and second part of the key is the finer grained time for that record.

However, while this approach is nice/optimal for "reads" because data is clustered by time, this approach isn't ideal from "write" scalability point of view, because during a given hour only one tablet will be getting all the write load. [In other words, if you had replication factor 3, and a 50 node cluster-- only three servers will be active in terms of the "write" IO path during a given hour. And for the next hour, it may be a different set of three servers.]

@yjiangnan
Copy link
Author

Hi @kmuthukk

Thanks for the info. I have multiple tables so the load can still be shared by also using the table name for bucketization. However, what the query should look like? If I just specify time>time stamp I would suspect that the query would still go to all the servers. So, do I need use something like time_in_hours >= "time stamp in hour"s and time > "time stamp" limit=1000? Additionally, the time is actually a TIMEUUID and the time stamp is a DOUBLE number. How can I compare these two different data types if an efficient way?

@pritamdamania87
Copy link
Contributor

Depending how the json is stored internally, which currently is a blob, it may be more efficient to group values for a period of time together (not too big and not too small period) to lower the overhead of each value.

@ddorian Although json is stored as a blob its serialized as jsonb which is efficient for searching through the serialized blob for appropriate json keys. At a high level, the json document is serialized with keys stored in a sorted order and array elements stored in way that they can be randomly accessed. As a result, searching for nested documents is efficient since its a combination of multiple binary searches for json keys or array index lookups. Updates to the jsonb document are expensive though, since we currently need to perform a full read-modify-write of the document, although we do have some optimizations in mind to improve this further as well.

@kmuthukk
Copy link
Collaborator

kmuthukk commented Jul 8, 2018

Hi @yjiangnan

With YCQL data model, at least as it stands now, the primary key must have at least one partition column. And partition columns don't support efficient > or < queries since data is not stored sorted. The sorting is only for the clustering columns within each value of a partition column.

So, you would have to do multiple look ups in the multiple hour buckets. You would use = predicate for hour bucket column of the primary key, and can additionally use range predicates (like <, >) for ranges of interest within that hour bucket.

For YCQL & for the SQL (Postgres) support we do have plans to support purely range based primary keys where the rows are globally sorted by the entire primary key (i.e. all columns can have a clustering order, and you don't need to have a partition column in the key).

@yjiangnan
Copy link
Author

Hi @kmuthukk

How do I know where the tablet is located and who is the leader? If I have a program to do complex processing of the data, I want to run the program on the same machine of the tablet and even have a leader program on the same machine of the tablet leader.

@kmuthukk
Copy link
Collaborator

kmuthukk commented Jul 9, 2018

hi @yjiangnan

Regarding "how do I know where the tablet is located and who is the leader" -- I believe this is being discussed by you and @m-iancu in the other thread.

Actually, it would be great to just fork this off as a separate github issue/question for this with a summary of the discussion so far. If that sounds good, would you mind creating the issue? Thanks.

regards,
Kannan

@kmuthukk kmuthukk closed this as completed Jul 9, 2018
jasonyb pushed a commit that referenced this issue Jun 11, 2024
#364)

As an observability tool that serves data to other tools, data must be output without any loss. So rounding off causes data loss and rounding off errors when comparing different columns.

Therefore, it was decided to eliminate rounding off when outputting values. Any consumer of this data should round off data to whatever precision it prefers.

This behaviour is also consistent with pg_stat_statements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants