Actions et tableau de desciption
NABDC: Not Another Big Data Conference
jeu. 1 juin 2017, 09:30 – 23:00 heure : France
NABDConf (Not Another Big Data Conference) is back for 2017! The conference by developers for developers is again bringing together the engineers who have spent their careers resolving difficult problems at scale with other engineers who are either facing similar problems or who are just plain curious as to what’s going on behind the scenes at web-scale companies like Criteo, Spotify, Uber and Google.
This year we will be discussing topics like moving PB-scale Hadoop operations from bare metal to the Google Cloud, building billion node, billion edge graphs, visualizing tremendous volumes of data in a browser, pretty much everything you would ever want to know about building large scale data pipelines, how one keeps track of PBs of data and raw database performance.
In addition to spending a riveting day with world class engineers in a small format conference, you’ll get to hobnob with them at the after party on Criteo’s world famous (I exaggerate, but only mildly) rooftop deck in the heart of South Pigalle (that’s in Paris for those who don’t know!).
The date is June 1st. The price is 50€. All proceeds go to charity. You can’t go wrong.
Kicking off NABDConf 2017
Speaker: Justin Coffey, Criteo
10:15 - 11:00
Title: Moving to the Cloud: A story from the trenches
Speaker: Josh Baer, Spotify
In 2016, the pointy haired managers that lead Infrastructure at Spotify decided they didn’t want to be in the datacenter business. The future, they said, was big and fluffy: the future was the cloud! And it was “Googley” (https://news.spotify.com/us/2016/02/23/announcing-spotify-infrastructures-googley-future/).
At that time, Spotify had 3000+ machines dedicated to running the services to power data processing: from Kafka, to Hadoop, to Spark, Storm and more. In this talk, Josh Baer — a product owner at Spotify managing the data processing migration— will walk through how the migration of these services has gone so far: outlining which services have translated to the cloud easily, which have been difficult to move and which have been replaced entirely by GCP offerings.
11:00 - 11:45
Title: VISUALIZING DATA WITH DECK.GL
Speaker: Nicolas Garcia Belmonte, Uber
Data is at the core of Uber's business and is fundamental for making informed decisions. The mission of the Visualization team at Uber is to deliver intelligence through the crafting of visual exploratory data analysis tools. To meet these needs, the team developed an open source visualization stack. In this talk Nicolas will give a brief overview of the Visualization team, their history, mission, and the most challenging problems they tackle. Then he'll do a deep dive into their core open source components and libraries that power most data products at Uber. He'll present their abstract and scientific data visualization stack, focusing on deck.gl, a WebGL framework for high-performance visualizations.
11:45 - 12:30
Title: HLL performance characteristics in large scale aggregations over structured data
Speaker: François Jehl, Criteo
We’ve all heard about HyperLogLog by now (if you haven’t, don’t worry, there’s a intro!) and how it can approximate billions of distinct values in data structures on the order of a few kilobytes, and while there’s been quite a lot published on its accuracy there is not a ton available on performance. More to the point, comparing hundreds of millions of 2KB data structures, even when already located in main memory, is expensive. As a result of this, we asked ourselves, “when should we aggregate HLL synopses and when should we aggregate raw event level data?”.
In this presentation we review the performance of HLL in Vertica versus raw event level data (also in Vertica). Additionally, to get an idea of the “raw” performance of HLL we use Druid, a popular in-memory datastore with native HLL support, as a baseline.
12:30 - 14:00: LUNCH
14:00 - 14:45
Title: Building a billion node / billion edge graph
Speaker: Bruno Roggeri, Criteo
At Criteo, besides our own cookie ids, we have access to billions of other identifiers:
- Mobile device ids
- Customer ID from thousands of merchants
- Email address hashes
Each of those identifiers can be associated one another whenever we notice a joint activity.
Yup, this is a graph!
Let’s see how we started computing the groups of connected ids 2 years ago – and the problems we’re starting to face just now as we leverage more and more associations of ids.
14:00 - 14:45
Title: DataDisco - One schema to rule them all and kill your data legacy
Speaker: Francois Visconte, Criteo ; Pawel Szostek, Criteo ; Neil Thombre, Criteo (ex-AWS RedShift)
Big Data begets Big Legacy.
Five years of Hadoop and Kafka in production and you get geological layers of data (90PB of data and 100To/day), jobs (20 000 jobs running every day), and code ( 300+ k LOC).
At some point, the ad-hoc, “anything goes” approach doesn’t work anymore and we had to find a path to make the whole data format agnostic, to remove hardcoded datacenter locations and path, and to move from the obligatory JSON stringly typed behemoth to a binary columnar format. Without converting any data, with no downtime, and as little impact as possible on the code.
From a schema-less approach we moved to using schemas as the source of truth for both data and infrastructure configuration.
We will see how this approach allows to:
- Describe data formats and localization as code
- Make hadoop development format agnostic
- Create observable data flows
- Describe infrastructure as code
..and how this actually plays out on a large production system, in terms of infrastructure, code, and data.
14:50 - 15:35
Title: Data pipeline at Spotify - from the inception to the production
Speaker: Rafal Wojdyla, Spotify
We all use the same tools and frameworks to process data, but the environment and best practices differ from one company to another. In this talk Rafal - an engineer at Spotify, will present the full journey, an idea has to travel from the inception to the full fledged data pipeline at Spotify. We will cover the tools and frameworks we use to ease the processes of bootstrapping, testing, validating and productionizing a new data pipeline. You will hear about some of the open source tools like scio, ratatool, gcs-tools and styx, as well as some internal ones. This talk will give you a sense of how does it feel to be a data engineer at Spotify - including all the struggle - you will see that we still have a long way to go.
14:50 - 15:35
Title: Detecting Fraud with Delight
Speaker: Yann Schwartz, Criteo
15:35 - 16:00: BREAK
16:00 - 16:45
Title: BigGraphite - Graphite meets Cassandra to Scale Monitoring at Criteo
Speaker: Corentin Chary, Criteo
16:00 - 16:45
Title: Time-series workflow scheduling with Scala in Langoustine
Speaker: Guillaume Bort, Criteo, Justin Coffey, Criteo
There are many workflow schedulers available today both in the FOSS and proprietary worlds. Langoustine is a new offering in this space. It is close in spirit to Airflow, though written in Scala and with a Scala DSL and with a specific focus on scheduling time series data set pipelines.
Langoustine is in production running the vast majority of Criteo data pipelines on the largest Hadoop cluster in Europe and in this talk we will discuss our hits and misses in building it as well as its future as an Open Source project.
16:50 - 17:35
Speaker: Martin Groner, Google
16:50 - 17:35
Title: Hi-WAY: Execution of Scientific Workflows on Hadoop YARN
Speaker: Marc Bux, Humbolt University Berlin
18:00: ROOFTOP Cocktail!