Whether designing database software or implementing an application using a database it is vitally important to consider who will be using the data, what they will be doing with the data, and how they will be accessing the data. These three areas are discussed here as:
- Personas — who is using the data. Also known as roles and archetypes.
- Use cases — what they are trying to do with the data.
- Access patterns — how exactly they are accessing and manipulating the data.
There can be significant overlap between various personas and use cases.
These can also be thought of as roles, archetypes, and even skill levels. Product features, documentation, training, and marketing messages may need to be framed differently to get through to distinct personas.
- The Lunatic Fringe. Super smart, willing to try anything, lack of doc and bugs and feature gaps are non-issues. A raw github and especially the source code is their preferred element. These guys excel at cobbling together disparate technologies into ad-hoc solutions. Equally they despise overly-integrated one size fits all solutions that hem them in.
- Early Adopters. Very smart, seeking true leading edge products. Their charter is to bring in fresh, new, exciting, and game-changing technology. Not as hard-core as the lunatic fringe. Access to source code may be a plus, but not an absolute requirement. At least some of them have a stronger affinity for easy to use integrated solutions, but a fair fraction of them are ready and eager to at least somewhat roll up their sleeves and work things out.
- Senior Java and C++ developers. Just as likely to be interested in the technology under the hood as the API
- Average Java application developers.
- Python, Ruby, etc. developers.
- Database administrators. Focused on data modeling.
- Ops. Care more about resource requirements, monitoring, and troubleshooting. The runbook is their bible.
- DevOps. Care equally about development and deployment.
- IT. These are the guys responsible for acquiring and provisioning hardware, basic software installation, and system-level admin. They will typically know a fair amount about traditional RDBMS databases and SQL, but NoSQL and distributed databases are a mystery to them. Outreach and education is recommended, otherwise these guys can be serious roadblocks to adequate provisioning of cluster hardware.
- End users writing SQL queries.
- Technical management. Want to understand the technology even if they are not hands-on with development or ops. Care about staffing requirements and cost as well. May be a hands-on CTO.
- Executive management. Focused on the big picture — what are the benefits to the enterprise, what are the liabilities, what are the costs. Will the new technology replace existing technology or be in addition to maintaining all of the existing technology? May be a hands-off CTO.
- Performance zealot. Demands maximum performance from a given hardware configuration.
- Cost zealot. Demands the cheapest possible hardware configuration.
- Data scientists. Heavy analytics, long-running and complex queries or bulk access to data.
- Toolkit jockeys. They hate integrated, one size fits all solutions. They love to customize. They love to tune and tinker.
- Integrated solution. They hate having to integrate, customize, and tune. They do not tinker. They are most happy if they can simply install the software and it just works.
There are four over-simplified user personas of relevance to databases:
- Contributors. These guys know their way around the source code and how to submit pull requests. Not much hand-holding, training, or do required.
- Advanced users. Not so much into the source code and pull requests, but they are very capable of filing issues and requesting sophisticated enhancements. Know how to file a decent bug report with more than enough info to guide contributors. Fairly skilled at exploiting product capabilities. Able to deduce workarounds for bugs and product limitations. If an exception or error occurs they can usually deduce both the cause and the solution.
- Basic users. Not so comfortable navigating github. Need more than a little hand-holding, training, and doc. Need advice on filing issues and guidance on what information to submit for bug reports. Need some degree of guidance on exploiting product capabilities. Need guidance on how to work around bugs and product limitations. If an exception or error occurs they usually will need support hand-holding to diagnose and resolve the issue.
- Laggard users. If something can be done in an inappropriate manner, these guys will do it. They are thoroughly resistant to quality doc, quality training, and quality support. Tell them to do something and they won’t. Tell them not to do something and they will do it anyway. They are not acting this way out of outright stupidity, but simply because they are in over their heads — they need a simple product to use.
The solution: The product needs to dynamically and automatically handle most conditions, both the expected and the unexpected, coupled with remote alerting and remote support so that more skilled external support staff can resolve any remaining issues. These are not the ideal users, unless you have the ideal automated product, but they are out there and prone to being sucked in my any excessive hype about ease of use, and there are a lot of them once you start penetrating traditional organizations, such as the less leading edge Fortune 500 and 1000 companies.
The bottom line is that they need a fully packaged solution rather than a toolkit even if it is the greatest if it has even a modest amount of “assembly required” or even most tuning and certainly it can’t require even a modest level of direct monitoring by these users. The main question is whether and when the product should cater to these users, and how to cope with them when they do show up.
Emphasis is on examples that highlight or stress the use of various features and access patterns. It would be good to pick a half dozen of these and have specific example schemas, data, and queries that illustrate the differences in the use cases.
Note: The intention here is not to list all possible applications, but to cover a wide range of data models, especially to pick application areas which have novel or illustrative data modeling challenges.
- Traditional SQL RDBMS databases.
- Sharded SQL RDBMS for high volume data.
- Social Media.
- Customer interactions (beyond explicit transactions of traditional databases).
- Time series.
- Logging. Using tables as alternative to raw text log files.
- Log search. Date range access and keyword search of log files as tables.
- Log analytics. Batch analytics of log files as tables, possibly with Hadoop.
- High frequency sensor state.
- Internet of things.
- Scientific data.
- Drug discovery.
- Drug trials.
- Medical records. Lots of semi-structured data, images, test results, as well as unstructured notes.
- Gene sequencing.
- Protein sequencing.
- Phone call records.
- Manufacturing ERP.
- Financial market data. Two-level key: security symbol and timestamp. Large volumes, high velocity.
- Financial transactions. Visa, American Express, brokerage firms.
- E-finance. Internet-based finance. PayPal, Lending Club, Square, Motif, Wealthfront.
- Semi-structured data such as documents. Commonly XML, such as SEC filings.
- Product catalog.
- Pricing. Frequent updating even if product catalog remains fairly stable.
- Human resources.
- Data aggregation vs. discrete data.
- Statistically significant data vs. high-value discrete data. For example, buying patterns and sentiment vs. individual financial transactions and personnel records.
- High frequency counters.
- High volume queues.
- Caching. For example, a sample of the massive data in a hadoop cluster.
- System of record. Is this database the true, main, master copy of the data, or is the true master source elsewhere and merely loaded into the database or indexed by the database?
- Embedded database. May be single-server or a few servers for availability. May be a commercial application designed to be deployed as a server blade on customer premises. For example, a network monitor.
- Data warehouse.
- Graph database. Ala Titan, Neo4j, Apache TinkerPop.
- Semantic Web graph storage and query. Subject, predicate, object triples. Support for SPARQL queries.
- Document repository. Office documents. Search of metadata and content.
- Object storage. Images, documents, etc. as raw blobs.
- Object metadata and indexing. For an external object store.
- Messaging storage. Email, chat, etc.
- Message metadata and indexing. For an external message store.
- High availability and resiliency for modest amounts of data. Data could fit in a traditional RDBMS, but not with the availability and resilience of a modern distributed database. May only be millions, tens of millions, or hundreds of millions of rows.
- Big Data. Data that just doesn’t fit in a traditional RDBMS without heavy manual sharding. Billions or even tens of billions of rows.
- Big Data with low latency, real-time access. Not just enough to store massive amounts of data, but instant real-time access is required.
- Big Data batch analytics. Instantaneous real-time access is not required, but reasonably rapid full-scan access is required for mission-critical analytics. Non-sequential access is also required for more sophisticated analytics such as graph detection and analysis.
- Raw key-value store. Fine for low-level storage, but most applications need significant structuring of their data.
- System of record. The authoritative source for the data in the database. The data may come from some external source, but once in the database, the data is maintained in the database, including changes. As opposed to a database which is somewhat transient and commonly and periodically reconstructed from external sources that are the true system of record.
- OLTP — online transaction processing. Robust transactions, high performance, real-time queries and updates.
- OLAP — online analytics processing. Heavy analysis of data, but primarily query load with little updating.
- Persistent in-memory. Data fits entirely in memory for minimal latency and maximal performance, but a durable transaction log and snapshots are persisted to storage.
- Distributed persistent in-memory. Local data fits entirely in memory, but operations that span ranges require fanout and aggregation, and distributed consensus.
- Transient in-memory. Supports database operations, entirely in memory, but without persistence. Optionally supports manual snapshots.
- Distributed transient in-memory. Permits sharing and coordinated transactions between nodes, but with no persistence. Optionally supports manual snapshots.
- Multitenancy. Two or more customers using the same database, possibly even sharing tables. Access control is an issue.
Emphasis is on use of database features, particularly in terms of performance. Categories are:
- Point vs. multi-row vs. bulk operations
- Schema changes
- Batch analytics
- Real-time analytics
- Legal considerations
- Bulk loading
- Load from an existing database
- Custom code to load at maximal rate
- Batch updates
- Incremental updates from normal application usage
- Occasional, infrequent updates.
- Rapid updates of the same row/column. Memtable vs. sstable issues?
- Reset a column for all rows to a specified value.
- Increment a column by a value for all rows.
- Occasional, selective deletion of column value or entire row
- Large batch deletions
- Modest to moderate batch deletions
- TTL to automatically delete column values or entire rows
- Bulk aging of data, maintaining a window of live data
- Pinpoint query for a specific row primary key.
- Slow queries. Anything that takes more than some threshold of execution time. Maybe several levels, like 250 ms, 500 ms, 1 sec, 2 sec, 5 sec, 10 sec, 30 sec, 1 min, longer.
- Narrow slice. 2–10 rows
- Medium slice. 20–100 rows
- Large slice. 250–2,000 rows
- Bulk retrieval. Maybe large number of small rows or a small number of large rows. Common for object/blob storage.
- Very large scan. 15% of a node
- Full node scan
- Node spanning scan
- Full table scan
- Scatter-gather or fanout queries for both small, medium, large, and very large number of results.
- Ad-hoc queries. Scatter-gather, but not known in advance, highly variable, on the fly
- Complex queries. More than just a few terms in the WHERE clause
- Full text queries. MS SQL Server has this built in.
- Integrated search engine. Ala DataStax Enterprise with Solr, or the Stratio and TupleJump Lucene plugins for Cassandra.
- Extract to external search engine. Batch extract or use of insert/update triggers to feed data and changes to an external search engine such as Elasticsearch or Solr.
- Geospatial search. For example, store locator. Contained in an arbitrary polygon or distance from a point.
- Date/time search. For example, store open hours or days of availability. May be a list of open times or available date ranges while query is for a point in time. Solr/Lucene uses geospatial search (two-dimensional time) to do this.
- Aggregation functions. COUNT, AVERAGE, MIN, MAX, etc.
- Simple 1-to-1 JOIN
- Simple inner JOIN
- Multi-level inner JOIN
- Simple outer JOIN.
- Multi-level outer JOIN.
- Mix of inner and outer JOIN
- Rename column. Column ID doesn’t change
- Delete column. Should existing data be removed or not?
- Add column. Optionally supply a default value to add for existing rows
- Reset column. Set column in each existing row to some specified value.
- Change column type. Optionally rewrite column in all rows, but might be better or an option to simply allow compatibility of types such as string vs. numeric. Updates would then use the correct type.
- Reorder columns. Column IDs don’t change.
- Full scan of the table with response in… many seconds to minutes?
- Narrow to medium slice scan with response within 0.5 to 5 seconds
- Access control for entire tables.
- Row-level access control.
- Column access control. For example, not all users can see salary data.
- Cell-level access control. Ala Accumulo.
- Law enforcement access.
- Data sovereignty and residency.
- Synchronous vs. asynchronous operations. The latter would be good for long-running requests.
- Server-side scripts, ala PL/SQL. Particularly long-running scripts such as bulk updates or analytics. Remote submission of scripts, carefully controlled, of course.
- Streaming and paging of results vs. all results from original request in a single response
- Raw key value store. Intense sequential scans, occasional scans, parallel/concurrent scans.