Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Scuba’s most powerful query patterns analyze the behavior of many actors in parallel, connecting multiple events to characterize the actor’s behavior over time. Unlike a traditional data warehouse system, Scuba computes these per-actor metrics on the fly during the query, allowing an extraordinary degree of interactivity when defining and refining them. This is especially important when the log data is “raw” and likely to be contaminated in ways that are not fully understood up front.

Events can be related only if they’re in the same table

Every event is placed in one or more event tables (datasets)/wiki/spaces/LEXICON/pages/1302397102 in your Scuba instance. Each property of the event goes in the table column corresponding to the name of the property. Typically, an event has several columns that describe the event itself, and several that allow it to be placed in the context of other events. For example, an event representing a click on a web page might have properties describing the click target and the state of the user’s interaction with the page, plus properties that identify the user account and server-side session that went into creating the web page.

...

Tip

Pro Tip: Focus on the layers of your system from which events can be logged with enough context to relate them to one another.

Every event must have a timestamp

That /wiki/spaces/LEXICON/pages/1302495994 doesn’t always correspond to the exact time that the event is logged; it might be backdated to reflect an estimate of when the event “really happened,” or filled in based on data extracted from another source.

...

Tip

Pro Tip: Favor a consistent ordering of timestamps on related events over absolute accuracy, even to the extent of time-shifting some of them and correcting for this later in the analysis.

Associate events with actors

Many queries require that an event be associated with an /wiki/spaces/LEXICON/pages/1302331425. Depending on the nature of your data and the complexity of your analysis needs, you might have one or more classes of actors whose behavior you are interested in analyzing. For example, you might have vendors and purchasers, or users and advertisers, or locations and personnel.

...

Tip

Pro Tip: Organize your logging around the classes of actors that participate in the most important events, and label each event consistently with the unique IDs of the actors involved.

Plan your shard keys

An actor has a permanent identifier, called a /wiki/spaces/LEXICON/pages/1302332001, which is a specific value of a specific shard column. If you want to analyze any relationship between multiple events, those events must have a shard key in common. A table can have multiple shard columns - at some cost in system resources - but these columns are part of the definition of the table, and changing them is a major administrative operation. Where possible, use multiple shard columns in the same table to associate events with completely different classes of actors, not with different kinds of identifiers for the same actor.

...

Tip

Pro Tip: Remember that the same real-world entity may be known by different IDs in different parts of your system, which are different “actors” for analysis purposes; plan your analysis accordingly.

Queries look at events within a time window

Every query has a time window, and only looks at the event data for events within (or, in some cases, close to) the time window. An actor might have other permanent properties, but most of the interesting properties of an actor can be extracted from events in or near the query’s time window. Event streams that involve long-lived user sessions, or other kinds of event context that isn’t always in the immediate time neighborhood, might need to be supplemented with periodic “keep-alive” logging. This can be added to the main log stream or supplied as a separate source of events to be folded into the same table.

...

Tip

Pro Tip: Give preference to categorizing actors by their recent behavior - and logging extra events as necessary to provide this context - over static auxiliary lookup tables.

Identify clusters of related events in as many ways as you can

You may also have smaller clusters of events for which you want to measure a cluster-level property. For example, the delay between delivering a message to a user’s inbox and the first time the user opened it for reading, or the number of times that the message was forwarded to other users. This kind of analysis works well when each event cluster is confined to a single shard key and is clearly separated from other clusters with the same shard key, in one of four ways:

...

Tip

Pro Tip: Identify clusters of related events in as many ways as possible - grouping in time, separator events, sequences with a known order, and per-cluster identifiers in colocated columns. Then back up your preferred analysis with cross-checks based on alternate clustering criteria.

Use transformations and derived columns, when necessary

Any column that is defined in an Scuba table, but not present on a specific event within the table, is considered to have a null value for that event in that column. Nulls are represented more compactly in Scuba than any other value. If a column in your data usually has a particular value (either fixed, or predictable from other properties of the same event), and only occasionally has some other value - for instance, an “affected account ID” that is almost always the same as the event’s main account ID, but is occasionally the account of another family member - it might be worthwhile to represent the predictable value with a null, and only record values that differ from the predictable value.

...

Tip

Pro Tip: Log events as “raw” as possible. Consider using ingest-time transformations and derived columns to eliminate redundancy, store events more compactly, and postpone complex decisions to query time.

Build helper columns

There are situations where it may be appropriate to represent one action in the system doing the logging by multiple events in Scuba. For instance, if an action is associated with two or more different actors in the same class - perhaps the sender and receiver of a message, both of which are user accounts - and you want to put it in sequence with each, you may need to log several copies of it, each with a different actor ID in the shard column. The copies may also differ in other columns, such as the role of the actor ID in the shard column (for example, “sender” vs. “receiver”).

...