...
These are the core data types you will see in the Scuba query UI.
Data Type | Aggregations | Grouping (Split By) | Filter Operators | Value Typeahead |
---|---|---|---|---|
Identifier | Count Unique, First, Last | Yes |
matches/does not match | No |
Integer* / Decimal | Count Unique, First, Last, Min, Max, Sum, Average, Median, |
Percentile | Defaults to No, can be configured to Yes by admin | is one of |
/is not one of |
|
|
|
| No | |
Integer Set / Decimal Set | Defaults to No, can be configured to Yes by admin |
matches/does not match [column/value] | No | |
String | Count Unique, First, Last | Yes |
is one of, is not one of, text contains, starts with, ends with
matches/does not match | Yes | ||
String Set | Count Unique | Yes | string value matches/does not match |
/does not contain text | Yes | |||
Time | Count Unique, First, Last, Min, Max, Sum, Average, Median, Percentile | No | is one of, is not one of, is less than (<), is greater than (>), is less than or equal to (<=), is greater than or equal to (>=) | No |
*Integers in Scuba are stored as 64 bit signed ints and so can only be a value from -(2^63) to 2^63 - 1.
...
You won't see these as being first class data types in the Scuba query UI, but at ingest time we apply expansion rules based on these data types.
Expansion Type | Data Type of Main Column | Generated Subcolumns |
---|---|---|
IP Address | String | city, region, country, continent |
URL | N/A (not loaded into Scuba by default) | scheme, hostname, path, filename, query, params, fragment |
User Agent | N/A (not loaded into Scuba) | browser, device, platform, browser_majorver, browser_minorver, browser_patchver |
Ingest time auto-transformations
At ingest time, we automatically perform the following transformations on the raw JSON data before applying any data type recognition rules or expansions.
Transformation Rule | Original JSON | Resulting JSON |
---|---|---|
Flatten Nested JSON Objects | {"column": {"a": 1, "b": "xxx"}} | {"column.a": 1, "column.b": "xxx"} |
Shred Arrays of JSON Objects | {"column": [{"a": 1,"b": "zzz"}, {"a": 2,"b": "yyy"}]} | {"column.a": [1,2], "column.b": ["zzz","yyy"]} |
String column transformations
If you have a regular string column that includes only the letters a through h, Scuba will auto-detect it as an integer and convert it to base 10. For example, "funding_series" which has values like (a, b, c, d, e) might fall into this scenario.
Original data | Transformed into... |
---|---|
"1234" | Integer values |
"abc123" | Integer (converted to base 10), for values up to 16 hex digits |
"abcdef-124" | Identifier, for values of 17 digits or more. The value can include hyphens. |
Ingest Time Data Type Recognition Rules For JSON Number Columns
At ingest time, when we see a new column for the first time (and it is a JSON Number) we detect the data type of the new column using the following matching rules, in the precedence order listed below:
Parsing Rule | Raw Data | Data Type | Rule Details |
---|---|---|---|
Detect Time | {"abc" : 1448933490} | Time | Scuba will attempt to interpret JSON ints as epoch timestamps in one of (microseconds, milliseconds, seconds). |
Detect Integer | {"abc" : 12345} | Integer | Simple JSON ints are interpreted as Integers by Scuba. |
Detect Decimal | {"abc" : 12345.98} | Decimal | |
Detect Integer Set | {"abc" : [12345, 245, 99834]} | Integer Set | |
Detect Decimal Set | {"abc" : [12345. |
98, 245.2, 99834]} | Decimal Set |
Based on the order of precedence, if there is ambiguity about whether a column value can be interpreted as an epoch timestamp or an int, Scuba will interpret it as an epoch time value.
...
At ingest time, when we see a new column for the first time (and it is a JSON String) we detect the data type of the new column using the following matching rules, in the precedence order listed below:
Parsing Rule | Raw Data | Data Type | Rule Details |
---|---|---|---|
Detect Time From JSON String | {"abc" : "2015-11-30 08:09:12"} | Time | Scuba will attempt to interpret JSON strings as timestamps using approximately 40 different format strings (including ISO-8601). |
Detect Identifier from JSON String | {"abc" : "e41249ed-2398-4c29-a6fa-ee81116dd302"} | Identifier | Scuba will attempt to interpret JSON strings containing 16 or more characters as hexadecimal identifiers, including some common uuid formats. Note that non-hex characters (like hyphens or dots) are stripped out of the resulting data. |
Detect Integer From JSON String | {"abc" : "12345"} | Integer | |
Detect Decimal From JSON String | {"abc" : "12345.98"} | Decimal | |
Detect Decimal From JSON String With Dollar Sign | {"abc" : "$12,345.98"} | Decimal | Note that dollar signs and commas are stripped out of the resulting data. |
Detect String Set | {"abc" : ["hello", "goodbye", "nice", "to", "see", "you"]} | String Set | |
Detect URL | {"abc" : "http://www.site.com/landing/"} | URL | |
Detect IP Address | {"abc" : "127.0.0.1"} | IP Address | |
Detect User Agent | {"abc" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0)"} | User Agent | Scuba will attempt to interpret JSON strings as User Agents using a regex matching scheme. |
Data Type Detection Settings For New Columns
...
Code Block |
---|
ia settings update purifier <Setting> <Value>
|
and the exact settings available are:
Setting | Value | Effect |
---|---|---|
strict_number_detection | 1 | Do not interpret JSON strings as Integer or Decimal. |
force_url_to_string | 1 | Do not interpret JSON strings as URLs. |
add_full_and_parsed_url | 1 | In addition to expanding the pieces of the URL, also store the original full URL as a String column. (This takes extra space.) |
force_geo_to_string | 1 | Do not interpret JSON strings as IP addresses. |
force_useragent_to_string | 1 | Do not interpret JSON strings as User Agents. |
force_hexn_to_string | 1 | Do not interpret JSON strings as hex identifiers. |
Note that the settings "add_full_and_parsed_url" and "force_url_to_string" are mutually exclusive.
...