Announcing ToroDB Stampede 1.0 beta

A few days ago we published a blog post, “The conundrum of BI/aggregate queries on MongoDB”, where we analyzed and measured some performance issues that happen on MongoDB with aggregate/OLAP/DW type of queries. We also showed that if we would transform the JSON data into a relational form and query it with SQL on a PostgreSQL database, performance can be up to orders of magnitude better. Impressive!

However, this requires a significant effort. The DDL needs to be defined –and this may be non-trivial if the origin data is of high variety. Also the data needs to be migrated, and while there are many ETL tools for that, it is still an involved process. And won’t happen in real-time! What if I require real-time statistics? What if my origin data adds a new property that is not reflected in the DDL? Do I need to sacrifice the “schema-less“ness of MongoDB for being able to perform analytic queries?

Of course not! At least, starting today. I’m very pleased to announce ToroDB Stampede, the first member of ToroDB’s family and 8Kdata’s first commercial product.

stampede

With ToroDB Stampede you will see how your MongoDB collections are transformed, in real time, to a relational structure in PostgreSQL. From there, you can run your native SQL queries on the data and use your favorite Business Intelligence tools, without requiring any ETL or change in your current database infrastructure.

So how does it work? It’s very simple:

  • Stampede works as a secondary (hidden) node on a MongoDB replica set.
  • Once started it will perform a full initial database sync and then will switch to streaming replication mode.
  • All the incoming data is transformed on-the-fly from a document shape (strictly speaking, BSON) into a set of relational tables. Tables will have the names of your document properties, arrays and nested documents will be transformed into relations, and columns named after the corresponding keys.
  • You don’t need to provide any DDL. All the DDL is automagically created by Stampede. Even if new keys or embedded documents appear, new columns and/or tables will be automatically and transparently created.

And this is where the fun begins. Now you have all your MongoDB data in perfectly shaped tables in a PostgreSQL database! Visualization and data exploration are greatly improved, and, more importantly, SQL querying, native SQL querying, is at your hand! Use it to connect to your favorite BI tools. Use it to migrate off of MongoDB to PostgreSQL. Use it to have a SQL replica. Unleash your unstructured data, into a relational database! See the example below to understand how ToroDB generates the tables and columns out of JSON documents, and check the documentation for more information.

toro_stampede_mappingSurely enough, performance matters. Does ToroDB Stampede deliver on the promise of 10-100x faster queries? There’s only one way to find it out. Benchmark time! The following benchmarks used one or more (when MongoDB was used in a sharded configuration) AWS i2.xlarge instances (4 vCPUs, 30GB RAM, 800Gb local SSD). We used a XFS filesystem and basic tunning was done on both MongoDB and PostgreSQL configuration. For each dataset, we manually created 6 different queries, that try to extract business value out of the information. MongoDB queries were done via the Aggregation Framework and Stampede ones with regular (Postgres) SQL. MongoDB 3.2 with WiredTiger (compression enabled, the default) and PostgreSQL 9.6 were used. All the tests were run 5 times, using the first two to warm up the caches and the numbers show the average of the last three runs.

Based on the Github Archive we performed an initial benchmark over a 500Gb dataset. We run 6 different queries (named A through F) which you may check here: MongoDB (A, B, C, D, E, F) and Stampede/PostgreSQL (A, B, C, D, E, F).

stampede-github500gb

Up to 57x faster! All queries are significantly faster than MongoDB, and only one (A) is slightly slower compared to a 3-node MongoDB cluster. Trying with a smaller dataset reveals even bigger differences. This is likely due to a much better buffer management in PostgreSQL:

stampede-github100gb

Woah! 267x faster! Query D takes 2,400 seconds on a single MongoDB node (about 20 minutes), 383 seconds on a three-node MongoDB shard (better, but still more than 6 minutes) and just 9 seconds on a single node PostgreSQL.

Here both MongoDB and Stampede had an index on both the _id and actor.login fields. Stampede will automatically replicate any index created in MongoDB (no action required on your side). We also wanted to try whether indexes were being used and what impact they had on the performance:

stampede-github100gb-no_idx

From the results we can conclude that: a) PostgreSQL results are almost the same, which is consistent with the assumption that indexes are usually not required for aggregate queries; b) MongoDB worsened the results for query A without indexes… but significantly improved query time for query D, when the index is removed! This may probably an issue with the query planner.

We also benchmarked another data set, based on the flights stats information from Transtats. Similar aggregate queries were written. Data size is smaller (50Gb) which leads to smaller differences:

stampede-transtats

Still, results are consistently faster even when pitched against the three-node MongoDB sharded cluster. And up to 11x faster queries, which is a very significant improvement! While developing Stampede we have performed benchmarks where we have observed more than 2000x faster queries. Of course, this may be a degraded case for MongoDB and surely Stampede does not perform always as well on every single circumstance.

So the recommendation is always the same: please don’t trust our numbers. Do your own. Benchmark Stampede, and please let us know the results.

ToroDB Stampede is fully open source, and it’s already available in our website for download and use. Check the user documentation to learn how to install, configure and run ToroDB Stampede.

If you need more information or you just simply would like to give us your opinion, please feel free to comment below or join the discussion on Hacker News! Thank you.

The conundrum of BI/aggregate queries on MongoDB

MongoDB is a very popular NoSQL database. MongoDB indeed popularized the concept of the “document”, a hierarchical set of key-value pairs, with a non-fixed structure (schema). Having flexible schemas and the 1:1 mapping between the business logic and the database objects are, arguably, key to MongoDB’s popularity. The reasons why MongoDB is used as the source of truth for many OLTP applications.

Surprisingly to some, dynamic schema does not avoid doing data design, nor it prevents from performance implications of that design:

When designing data models, always consider the application usage of the data (i.e. queries, updates, and processing of the data) as well as the inherent structure of the data itself.

But what if queries cannot be known ahead of time? What if different set of queries have almost orthogonal design needs? After all, the relational theory and data normalization are precisely aimed at solving this problem: a data representation that is completely independent of the queries.

Queries that need to retrieve a document or part of it, typically indexed, perform very well. But what are the consequences of performing queries on MongoDB for which data was not modeled after? Indeed, how do aggregate queries –which represent a completely different query pattern of the typical OLTP gimme-this-single-document query– perform? How does it perform when used with Business Intelligence (BI) applications, where queries are unknown and left for the end users?

Let’s reason intuitively first. What is NoSQL’s “schema-less” concept? It means that any document, with any given structure (there’s always a structure) is stored into the system as-is. One after the other. So if there are mixed information, different data belonging to different “parts” of the business logic, some documents with some properties, others without, what does it look like? Chaos!

chaos

Aggregate queries typically require scanning many documents, looking for specific properties of the data, often without the help of an index. Since NoSQL stores document with any structure and data, how does the database know which documents belong to the query criteria? It doesn’t! It has to scan the whole dataset (normally, the whole collection) to answer the query.

This scan is not cheap. Below is depicted how it works on MongoDB. First, decompression may need to be performed. Then, for each document, the BSON document needs to be traversed and evaluated with the query criteria. Worst case, every single key-pair of every document needs to be evaluated.

Let’s use the Github Archive dataset as an example, and illustrate a query to return the top 10 most active actors. This is how MongoDB would internally process all the data in the collection to find the query results:

MongoDB scan operation

And the MongoDB aggregate query:

db.githubarchive.aggregate([
  {
    $group: {
      _id: '$actor.login', events: { $sum: 1 }
    }
  },
  { $sort: { events: -1 }},
  { $limit: 10 }
])

On comparison, a RDBMS with a normalized data design will typically target a subset of the tables, and only the columns involved will be evaluated. So much less I/O, which is normally the bottleneck of a database, will be required to answer the same query. In this case, only the table github_actor is required to answer the query:PostgreSQL table scanThe query on a PostgreSQL relational database would be:

SELECT count(*), login
FROM github_actor
GROUP BY login
ORDER BY 1 DESC
FETCH FIRST 10 ROWS ONLY;

 

The I/O required by both queries can be measured using iotop -o -a:

iotop: MongoDB vs PostgreSQLUsing a 1,4Gb sample of the Github Archive dataset, the query in MongoDB requires exactly the same I/O as the collection’s reported storage size (536,37Mb). This was expected, the whole collection is scanned. PostgreSQL scans just a 6th of the data to answer the query. Taking this further, if we would use Citus Data’s cstore (a columnar storage extension for PostgreSQL), the results are even more surprising: answering the query requires just 1/100th of the IO that MongoDB requires:

iotop: MongoDB vs PostgreSQL vs PostgreSQL with cstore

So the performance of aggregate queries on NoSQL, more precisely MongoDB, is not as good as expected. What is indeed the reality? We measured the execution time of the above queries comparing MongoDB and PostgreSQL on a 1Gb, 10Gb and 100Gb Github Archive sample of the dataset: (measuring query execution time; less is better)

Aggregate query performance: MongoDB vs PostgreSQL

Without zoom is hard to appreciate PostgreSQL’s results:

Aggregate query performance: MongoDB vs PostgreSQL

Similar aggregate queries deliver the same results. At 8Kdata, while developing ToroDB, we have performed dozens of similar queries. The results are consistently clear: a relational design almost always outperforms MongoDB aggregate queries by several factors, often orders of magnitude. Although we have seen even more pathological cases (i.e., even faster), it can safely be said that it is “100x faster”.

So one solution for the BI/aggregate queries on MongoDB is to bring them to SQL! Indeed, Stripe did that. There are ETLs too. However, all of these solutions require complex processes and designing the SQL schema of the destination data on your RDBMS. There’s no easy way to bring MongoDB to relational.

Hope is not lost, though. In a few days, 8Kdata will be publicly announcing a solution to this problem. If you are interested, keep watching this blog, subscribe to our newsletter and follow us on Twitter. Stay tuned!

PostgreSQL page layout

As developers we commonly use databases, even daily, but do we really know how they work? How do they store information? Which internal structure do they use? This post will cover how information is stored in PostgreSQL, from the table structure to the row’s organization at a disk level.

Getting page information

To get the detailed information of data structure we may use the pageinspect extension. After installing this extension we will be able to get the low level detail of PostgreSQL pages.

To install it:

# create extension pageinspect;

Tables from our database

PostgreSQL has much more tables than the ones included in our schema. It stores a lot of data, for example, the tables and colums from our schemas.

There are more than one way to get the schema tables, but the easier one is the command \dt.

# \dt eightkdata.*

The previous sentence will return a list of the relations included in the schema eightkdata.

Page structure in PostgreSQL

Once we know the tables from our schema, it is possible to analyze the page structure using some functions included in the pageinspect extension. For example, we can get the header information in each page, like it is done in the next example for the first page from table eightkdata.httparchive.

# SELECT * FROM page_header(get_raw_page('eightkdata.httparchive', 0));
┌────────────┬──────────┬───────┬───────┬───────┬─────────┬──────────┬─────────┬───────────┐
│    lsn     │ checksum │ flags │ lower │ upper │ special │ pagesize │ version │ prune_xid │
├────────────┼──────────┼───────┼───────┼───────┼─────────┼──────────┼─────────┼───────────┤
│ 6/AA8DD3F0 │        0 │     0 │   928 │   960 │    8192 │     8192 │       4 │         0 │
└────────────┴──────────┴───────┴───────┴───────┴─────────┴──────────┴─────────┴───────────┘

The header of a page are the first 24 bytes and contains basic information to management future changes, like pointers to the free space or the size of the page (in PostgreSQL by default it is 8KB).

Each column in page_header has a meaning:

  • lsn: Log Sequence Number: is the address of the next byte to be used in the page xlog.
  • checksum: Page checksum.
  • flags: Various flag bits.
  • lower: The offset where the free space starts, it will be the initial address to the next tuple created.
  • upper: The offset where the free space ends.
  • special: The offset where the special space starts, it is at the end of the page actually.
  • pagesize: It is the size of the page, 8KB by default, but it can be configured.
  • version: Version number of the page.
  • prune_xid: Signals when pruning operation can be a good option to improve the system.

This data gives a nice view of the page organization that is represented in the next diagram.

PostgreSQL page layout

There are some data in the previous diagram that doesn’t appear in the page_header, they are part of PostgreSQL configuration, like the fill factor or the alignment padding.

Fill factor

The fill factor is a value that tells PostgreSQL when to stop storing tuples in the current page and switch to a new one. By default pages are not completely filled. This allows to store tuples updates in the same original page, increasing the system performance.

Alignment padding

To increase I/O operations efficiency PostgreSQL uses a word length that depends on the machine where it is running. In modern computers with 64 bits processors the word length is 8 bytes.

This causes that tuples are not exactly the same size they are occupying in disk, because PostgreSQL uses this extra space (alignment) to increase I/O performance.

Tuple structure in PostgreSQL

Tuples can be analyzed too using another function called heap_page_items.

# select * from heap_page_items(get_raw_page('eightkdata.httparchive', 0)) limit 10;
┌────┬────────┬──────────┬────────┬────────┬────────┬──────────┬────────┬─────────────┬────────────┬────────┬────────┬───────┐
│ lp │ lp_off │ lp_flags │ lp_len │ t_xmin │ t_xmax │ t_field3 │ t_ctid │ t_infomask2 │ t_infomask │ t_hoff │ t_bits │ t_oid │
├────┼────────┼──────────┼────────┼────────┼────────┼──────────┼────────┼─────────────┼────────────┼────────┼────────┼───────┤
│  1 │   8160 │        1 │     29 │   6202 │      0 │        9 │ (0,1)  │           2 │       2304 │     24 │ ¤      │     ¤ │
│  2 │   8128 │        1 │     29 │   6202 │      0 │        9 │ (0,2)  │           2 │       2304 │     24 │ ¤      │     ¤ │
│  3 │   8096 │        1 │     29 │   6202 │      0 │        9 │ (0,3)  │           2 │       2304 │     24 │ ¤      │     ¤ │
│  4 │   8064 │        1 │     29 │   6202 │      0 │        9 │ (0,4)  │           2 │       2304 │     24 │ ¤      │     ¤ │
│  5 │   8032 │        1 │     29 │   6202 │      0 │        9 │ (0,5)  │           2 │       2304 │     24 │ ¤      │     ¤ │
│  6 │   8000 │        1 │     29 │   6202 │      0 │        9 │ (0,6)  │           2 │       2304 │     24 │ ¤      │     ¤ │
│  7 │   7968 │        1 │     29 │   6202 │      0 │        9 │ (0,7)  │           2 │       2304 │     24 │ ¤      │     ¤ │
│  8 │   7936 │        1 │     29 │   6202 │      0 │        9 │ (0,8)  │           2 │       2304 │     24 │ ¤      │     ¤ │
│  9 │   7904 │        1 │     29 │   6202 │      0 │        9 │ (0,9)  │           2 │       2304 │     24 │ ¤      │     ¤ │
│ 10 │   7872 │        1 │     29 │   6202 │      0 │        9 │ (0,10) │           2 │       2304 │     24 │ ¤      │     ¤ │
└────┴────────┴──────────┴────────┴────────┴────────┴──────────┴────────┴─────────────┴────────────┴────────┴────────┴───────┘

For each tuple there is some information available about position inside the page, visibility or size among others.

  • lp: The index of the tuple in the page.
  • lp_off: Offset of the tubple inside the page.
  • lp_flags: Keeps the status of the item pointer.
  • lp_len: Length of the tuple.
  • t_xmin: Transaction number when the tuple was created.
  • t_xmax: Transaction number when the tuple was deleted.
  • t_field3: It can contains one of two possible values, t_cid or t_xvac. The t_cid is the CID signature from the insert or delete. The t_xvac is the XID for the VACUMM operation when row version changes.
  • t_ctid: Current TID.
  • t_infomask2: Number of attributes and some flag bits.
  • t_infomask: Some flag bits.
  • t_hoff: Is the offset where the user data is stored inside the tuple.

Tuple header layout

TOAST (The Oversized-Attribute Storage Technique)

Looking at page size it is easy to discover that some data cannot be stored in such a small space. For these cases there is a mechanism called TOAST.

By default PostgreSQL has two variables, toast_tuple_threshold and toast_tuple_target with value 2K. When a tuple is being stored and is larger than 2K, the fields where it can be applied (not all of them apply to TOAST) are stored in a TOAST table.

En 8Kdata buscamos un unicornio

¿Te gusta hackear el protocolo de Mongo? ¿Y traducir de BSON a un lenguaje de clave valor abstracto y de ahí a un modelo relacional? ¿Sientes placer al exprimir hasta el último bit de la maravillosa librería de jOOQ? ¿Eres capaz de escribir más visitors por cada 1000 líneas de código que nadie en el mundo? Si estás asintiendo mientras lees esto, no lo dudes, eres el unicornio que buscamos.

unicorn

Nuestros proyectos estrella ToroDB y BigToro están creciendo a tal ritmo que el equipo de desarrollo de 8Kdata no da abasto y estamos buscando varios desarrolladores para trabajar con nosotros en el Toroplex, nuestra oficina en Alcobendas (Madrid).

¿A quién buscamos?

Tenemos varias plazas vacantes para desarrolladores Java con experiencia profesional anterior y conocimientos de la mejor base de datos software libre (aka PostgreSQL) que quieran volverse igual de locos que nosotros desarrollando la primera base de datos construida 100% en España. Si ACID no es un tipo de música para ti, este es tu sitio.

Además de los requisitos anteriores, se valorará muy positivamente tener conocimientos de sistemas distribuidos, base de datos de Oracle y Amazon Web Services. Y, dada la orientación internacional de la compañía, es importante un nivel mínimo de inglés.

¿Qué harás?

Programar como si no hubiera mañana. Desde hackear el código de PostgreSQL para entender por qué no deja guardar ‘\0’ en el texto, o conseguir lanzar una JVM dentro -sí, DENTRO- de PostgreSQL hasta hacer un backport del código de JSON de PostgreSQL a Greenplum o montar una antena ADS-B en la azotea de nuestra terraza para loggear y procesar en ToroDB los aviones a 360km a la redonda.

¿Qué ofrecemos?

Si quieres unirte a nuestro equipo, te ofrecemos:

  • Salario bruto anual entre 30.000 y 40.000 euros, dependiendo de tu experiencia.
  • Contrato laboral indefinido.
  • Herramientas: portátil a estrenar (PC -vamos, Linux- o Mac)
  • Horario flexible.
  • Posibilidad de teletrabajar algún día a la semana.
  • Flexibilidad en la elección de los días de vacaciones.
  • Libertad y responsabilidad para desarrollar tu trabajo.
  • Amplia y moderna oficina en Alcobendas, con piscina, jacuzzi y cocina siempre provista de bebidas y snacks gratuitos.
  • Además, como compartimos espacio con otogami.com, siempre tendrás una consola o una recreativa para “echar una partidita”.

Si crees que podrías ser uno de nuestros candidatos, por favor, escribe un e-mail a candela@8kdata.com adjuntando tu CV y explicándonos brevemente por qué crees que encajarías perfectamente en los proyectos de la compañía y cuál sería tu disponibilidad de incorporación.

believe

FAQ

¿En qué proyecto participaré?

Actualmente, trabajamos en paralelo en varios proyectos, todos ellos relacionados con las bases de datos y, por eso, hemos abierto varios procesos de selección. Dependiendo de tu CV y disponibilidad, te ofreceremos que participes en uno o en otro.

¿Dónde trabajaría físicamente?

En las oficinas del Grupo 8Kdata, en Alcobendas, con una nevera SMEG siempre llena de cerveza, refrescos, zumos, jamón del bueno, queso y la leche sin lactosa de Matteo. Si quieres conocer nuestra ubicación exacta, pincha aquí.

¿Puedo teletrabajar?

Puntualmente, pero creemos que la cercanía es necesaria para obtener la comunicación y colaboración necesarias para poder cumplir con el ambicioso roadmap de nuestros proyectos.

¿Cómo será el proceso de selección?

  • Primero deberás enviar un correo a candela@8kdata.com con tu CV y un breve resumen de tu perfil y por qué crees que este podría encajar en la actividad de 8Kdata, así como tu disponibilidad para incorporarte a la empresa. Por favor, actualiza el CV para que no se te olvide incluir ningún título oficial que creas que pueda ser relevante. Los títulos oficiales no son indispensables, pero sí se valorarán muy positivamente en alguno de los procesos abiertos.
  • Si tu perfil no se adapta a nuestras necesidades, prometemos contestarte.
  • Si tu perfil cuadra con lo que buscamos, te enviaremos una prueba técnica para hacernos una idea de tu experiencia y habilidades.
  • Si superas la prueba, para conocerte y que nos conozcas, te haremos una entrevista personal.

State of the Art of DBaaS

The folks of Tesora have published an infographic about the market of DBaaS (Database as a Service) one of the fastest growing niches in the Software Industry. What do you think about it? Are you using DBaaS? Should we offer ToroDB on the Cloud as a DBaaS? Which is your favourite Provider?

Please, don’t hesitate to share all your thoughts and ideas with us, using the blog Comments or our Social Networks.

dbaas

8 Announcements from 8Kdata

Today is one of those days that mark the history and the future of a startup. At least, that’s what we feel. Too many good news that we want to announce, make public and share with you:

#1. We have acquired Otogami!

Otogami is one of the most renowned startups in the Spanish IT sector. Led by tech gurus David Bonilla and Jerónimo López, this acqui-hiring brings the -much needed- productization and marketing capabilities that 8Kdata and ToroDB need. The whole Otogami staff is now part of 8Kdata’s amazing team.

8kdata-family

#2. We are still on the game!

The two product lines developed and run by Otogami (Otogami itself and Runnics) will continue operating as usual. While departing from 8Kdata’s products and area of expertise, we don’t have any plans to shut them down. No worries!

#3. We have a new office!

To cope with our current team, and our projected growth, we recently moved to Alcobendas (Madrid, Spain), the 3rd largest Spanish city in terms of foreign multinational corporations. Our new Toroplex is a large, cool, modern loft, with two terraces to sun tan while programming 🙂 Our offices are open for you to visit us any time!

#4. Better late than never. We finally have a (new) website!

Even more importantly, now ToroDB has a new home. It will be augmented soon, with more information, documentation and all that you can expect from a database that will reach GA at the end of this year. There’s also dedicated pages to the range of our PostgreSQL’s Professional Services: support, consulting and training.

#5. New Bull on the Block!

Speaking of ToroDB and the roadmap, The next version v0.40 will be out this Q2 2016, and will bring to the table replication support for the MongoDB protocol (as a secondary node) and the Greenplum open source database as a backend, plus significant, really significant, performance improvements. Check it out on our GitHub public repo!

#6. We will be at Percona Live 2016!

If you want to know more about ToroDB, and happen to be in Silicon Valley next week, our CEO Álvaro Hernández will be speaking at the Percona Live 2016 conference about “ToroDB: Supercharging your RDBMS with MongoDB Superpowers”. It’s really cool that we received this invitation, given that we are, initially, a database based on Postgres or Postgres-derived backends!
Hint: maybe, and only maybe, there’s also something about My in that talk 😉

#7. New investors onboard!

Previous Otogami’s investors have joined us. A bunch of well-known Spanish Business Angels and the VC fund Vitamina K are among them. Welcome, and thank you for your trust and support.

#8. One more thing…

Today we’re also announcing Big Toro. Big Toro is a ToroDB-derived product that aims to enable analytics for NoSQL. What do we mean by enabling? Most, if not all, of NoSQL databases, have a hard time doing analytics: they lack native SQL support; and more importantly, they really struggle when doing aggregated queries, due to the unstructured nature of NoSQL.

BigToro, like ToroDB, creates automatic SCHEMAs for you, out of the NoSQL documents, that are replicated in real-time from a MongoDB replica set (no need to ETL!). Then you can use native SQL to make your analytic queries, on top of an analytics database, like Greenplum or CitusDB. Big Toro GA will be announced later this year but the current version is already 100x faster than other solutions.

SELECT what, why FROM “8Kdata”;

what why

If anything, we are a group of passionate developers and database enthusiasts.

8Kdata is a database R&D company, aka we-love-databases-and-we-come-up-with-geeky-projects-about-them. Cool thing is that some of those geeky projects end up being serious products that will hopefully revolutionize our industry in one way or another.

We are (and started as) a PostgreSQL company. We love Postgres since we first knew about it, circa year 2,000. Postgres is the most advanced open source database –that’s the motto. And it deserves a companion of high quality Professional Services. That’s what we strive to offer, be it consulting services, support or training.

Back to the products. After researching whether it was possible to create a billion tables in a database, we started ToroDB. ToroDB is the first database that is both NoSQL and SQL at the same time. It speaks the MongoDB wire protocol, so you can use your MongoDB drivers, tools and programs; but data will be transformed into relations and stored in a PostgreSQL database. Which you could query with either the MongoDB API or pure, native SQL.

Then there’s also BigToro (coming soon!), a ToroDB’s offspring. BigToro wants to really enable Data Analytics for NoSQL. The initial advantages of unstructured data turn to really poor results on aggregated queries, the ones used in analytics. By an automatic process, BigToro replicates from a MongoDB cluster, structures data in relational tables, and stores it in PostgreSQL or shards it into analytics/warehousing databases (like Greenplum or CitusDb) to speedup aggregated queries, performed in SQL.

As we completely believe in open source, both ToroDB and BigToro are open source. Fully. We don’t do “open core” and then “enterprise” proprietary software. We’re fully committed to open source and all what we do is open source. Forever.

Since the introduction of the NoSQL databases, users need to choose between either a NoSQL database or a relational one. The NoSQL databases are meant to provide agility and scalability, at the cost of trading either durability, consistency and/or good query capabilities. The relational databases specialize in durability and consistency, but are harder to scale and shard.

Wouldn’t it be possible to have both the benefits of the NoSQL and the relational databases in the same place? This is one of the questions that we repeatedly asked to ourselves.

It’s surprising the number of NoSQL databases that have appeared in the last years. What’s even more surprising, at least for us, is that they both started from scratch and have their own API/protocol. Isn’t that a lot of re-inventing the wheel? In our opinion, it is.

So we asked ourselves a second question: is it possible to create a NoSQL database, without throwing away the experience and know-how acquired by the relational databases during the last 3-4 decades? Is it possible to construct such a NoSQL database based on a relational database, that would also be compatible with existing APIs/languages?

Our answer is, of course, ToroDB.

We believe there is room to create better databases. That despite the myths of relational databases being slow, RDBMSs have proven to be really fast and are software that you can really trust. We believe that in order to have NoSQL you don’t have to sacrifice transactions, strong isolation and consistency semantics and, of course, you don’t have to sacrifice SQL, the lingua franca of data querying.

We also strongly believe that databases have to be fully open source. After all the NSA and spying scandals, no organization (big or small, but specially Governments) should store their data on some software which may be sending data or opening backdoors to spying agencies or other third parties. The only viable alternative is open source.

(1 row)

Back to top