The Engineering Blog

State management in Angular with observable store services

This post was originally published on jurebajt.com.

Effective state management in front-end development is a challenge, especially in larger and more complex single page applications. Right now, Redux is probably the most popular way of managing state. It is based on a few main ideas:

  • One source of truth (app state).
  • State is modified in a “pure” way via reducers.
  • Reducers are invoked by emitting events to them.
  • Interested entities are notified about state updates.

At Zemanta we have a client facing dashboard application built as a hybrid Angular app (running AngularJS and Angular side by side). AngularJS part of the app stores some state in components’ controllers and other in global services (implementing a pub-sub pattern). Every feature manages its state in a slightly different way because there are no clear conventions set about state management. As a consequence, the more features we add, the harder it becomes to ensure the state stays consistent across all components and services.

The process of upgrading to Angular gave us the opportunity to rethink how we tackle state management in the app. We didn’t want to introduce another layer of complexity by adding a state management library to the codebase. New Angular framework, TypeScript, new build system and hybrid app bootstrap already brought a lot of additional complexity to the mix. Instead, we used the ideas from Redux to create a state management solution that leverages Angular’s (and RxJS’s) features to do its job.

One could argue that developing a custom solution for state management introduces additional complexity to the codebase too. It would be naive to dismiss such claims. The difference though is in how much of this complexity is added by developing features using Redux versus observable store pattern. The solution we developed is a really stripped down version of Redux. It does not “prescribe” how to handle async actions, how to combine reducers, how to implement middleware etc. Its only role is to provide a simple API to update state object and to subscribe to its updates. Stores are otherwise just good ol’ Angular service classes.

This post explains how one can use the observable store pattern we developed to manage state in Angular apps. The solution was inspired by the following article from Angular University: How to build Angular apps using Observable Data Services.

To showcase the usage of observable stores we’ll build a simple app called Coffee election that lets its users vote for their favorite type of coffee and add their own coffee type to the list of candidates. The source code is available on GitHub: github.com/jurebajt/coffee-election.

Abstract Store class

At the core of observable store pattern is the abstract Store class. It leverages RxJS to achieve data flow similar to Redux. It is implemented like this:

store.ts:

import { BehaviorSubject } from 'rxjs/BehaviorSubject';
import { Observable } from 'rxjs/Observable';

export class Store<T> {
  private _state$: BehaviorSubject<T>;

  protected constructor (initialState: T) {
    this._state$ = new BehaviorSubject(initialState);
  }

  get state$ (): Observable<T> {
    return this._state$.asObservable();
  }

  get state (): T {
    return this._state$.getValue();
  }

  setState (nextState: T): void {
    this._state$.next(nextState);
  }
}

The store’s state (_state$) is a RxJS BehaviorSubject. Changing the state means pushing new state object into the _state$ stream via the setState method. Interested entities can subscribe to state updates by subscribing to the state$ property. It is also possible to get the current state via the state property without subscribing to state updates.

Store class provides a unified interface for all features’ store services to extend. In the next section we’ll have a look at how to use the abstract Store class to implement an example feature store service.

Features’ stores

Feature specific stores are Angular Injectables extending the abstract Store class:

coffee-election.store.ts:

@Injectable()
export class CoffeeElectionStore extends Store<CoffeeElectionState> {
  ...
}

In the code snippet above note the CoffeeElectionState type used when extending the Store. Specifying CoffeeElectionState as the store type adds correct type definitions to the generic store.

CoffeeElectionState is a class representing state object with initial values. In the Coffee election example app it looks like this:

coffee-election-state.ts:

export class CoffeeElectionState {
  candidates: {name: string, votes: number}[] = [];
}

One last thing to do to make this simple example work is to add a super call to CoffeeElectionStore’s constructor in order to correctly initialize the state when creating a new instance of CoffeeElectionStore:

coffee-election.store.ts:

constructor () {
  super(new CoffeeElectionState());
}

With the above code in place, each instance of CoffeeElectionStore has a way of setting its state and getting the current state or an observable of the state. To make it more useful, additional methods to modify the state (similar to Redux reducers) should be added:

coffee-election.store.ts:

@Injectable()
export class CoffeeElectionStore extends Store<CoffeeElectionState> {
  constructor () {
    super(new CoffeeElectionState());
  }

  addVote (candidate: {name: string, votes: number}): void {
    this.setState({
      ...this.state,
      candidates: this.state.candidates.map(c => {
        if (c === candidate) {
          return {...c, votes: c.votes + 1};
        }
        return c;
      })
    });
  }

  addCandidate (name: string): void {
    this.setState({
      ...this.state,
      candidates: [...this.state.candidates, {name: name, votes: 0}]
    });
  }
}

In the example above CoffeeElectionStore’s functionality was extended by defining addVote and addCandidate methods. In essence, these methods modify the state by pushing new state objects into the observable state$ stream via the setState method.

Note how it is impossible to modify the state without notifying listeners about the change. This characteristic of observable stores makes them a perfect fit for implementing one-way data flow in Angular apps - much like with Redux or a similar state management library.

Using injectable store services

App’s state could all be stored in a single global state object. But as the app grows, so does the state object and it can quickly become too big to easily extend it with new features. So instead of storing the whole state in one place, it is better to split the state into smaller chunks. A good way to split the properties is to group them by feature and extract these groups into separate state objects, managed by corresponding stores.

There are two types of stores that emerge from splitting:

  • global stores that contain globally used state,
  • component stores that contain the states used by a single component.

To set up a store containing global state accessed by different services and components, the store is listed as a provider in a module’s providers list (root app module or a feature specific module). This way Angular adds a new global provider to its dependency injector. The state in global stores will be available until the page is reloaded.

app.module.ts:

@NgModule({
  ...
  providers: [ExampleGlobalStore],
})
export class AppModule {
  ...
}

Note that many global stores can be defined as providers in app’s modules, each managing its own subset of global state. The codebase stays much more maintainable this way, since each store follows the principle of single responsibility.

To use a global store in different parts of the app, the store needs to be defined as their dependency. This way Angular injects the same instance of a global store (defined as singleton provider in AppModule or any other module) into every component/ service depending on it.

example.component.ts:

@Component({ ... })
export class ExampleComponent {
  constructor (private exampleGlobalStore: ExampleGlobalStore) {
    // ExampleComponent has access to global state via exampleGlobalStore reference
  }
}

Not all state needs to be global though. Component specific state should only exist in memory if a component is using it. Once user navigates to a different view and the component is destroyed, its state should be cleaned-up too. This can be achieved by adding the store to a list of component’s providers. This way we get “self-cleaning” stores that are kept in memory as long as components using them are kept in memory.

example.component.ts:

@Component({
  ...
  providers: [ExampleComponentStore],
})
export class ExampleComponent {
  ...
}

Private component stores are used in the same way as global stores by defining them as dependencies in the components’ constructors. The key difference is that these component specific stores are not singletons. Instead, Angular creates a new instance of the store each time a component depending on it is created. As a consequence, multiple instances of the same component can be present in the DOM at the same time, each one of them having its own store instance with its own state.

Subscribing to state updates in components and services

Once a store instance is injected into a component or service, this component/ service can subscribe to state updates. In the example of coffee-election component, subscribing to state updates looks like this:

coffee-election.component.ts:

@Component({ ... })
export class CoffeeElectionComponent implements OnInit {
  constructor (private store: CoffeeElectionStore) {}

  ngOnInit () {
    this.store.state$.subscribe(state => {
      // Logic to execute on state update
    });
  }
}

It is also possible to only subscribe to updates of a subset of state:

this.store.state$
  .map(state => state.candidates)
  .distinctUntilChanged()
  .subscribe(candidates => {
    // Logic to execute on state.candidates update
  });

Note that these subscriptions must be cleaned up before a component is destroyed in order to prevent memory leaks. We won’t go into details about unsubscribing in this post. Check out this topic on Stack Overflow to learn more.

Subscribing to state updates in components’ templates

In case a component doesn’t execute any logic on state update and it only serves as a proxy to pass the state to its template, Angular provides a nice shortcut to subscribe to state updates directly from templates via the async pipe. ngFor in the example below will redraw a list of candidates every time the state is updated.

coffee-election.component.html:

<ul>
  <li *ngFor="let candidate of (store.state$ | async).candidates">
    <span>{{ candidate.name }}</span>
    <span>Votes: {{ candidate.votes }}</span>
    <button (click)="store.addVote(candidate)">+</button>
  </li>
</ul>

These subscriptions to state updates via async pipes are automatically cleaned up by the framework upon destroying the component.

Unit testing the store

Testing state modifying store methods is pretty straightforward. It consists of three steps:

  1. Creating an instance of the tested store and setting up mocked initial state.
  2. Calling a store’s method the test is testing.
  3. Asserting the method updated the state correctly.

In practice unit tests to test the store from the Coffee election example look like this:

coffee-election.store.spec.ts:

describe('CoffeeElectionStore', () => {
  let store: CoffeeElectionStore;

  const MOCK_CANDIDATES = [{name: 'Test candidate 1', votes: 0}, {name: 'Test candidate 2', votes: 5}];

  beforeEach(() => {
    TestBed.configureTestingModule({
      providers: [CoffeeElectionStore]
    });

    store = new CoffeeElectionStore();
    store.setState({
      candidates: MOCK_CANDIDATES
    });
  });

  it('should correctly add a vote to a candidate', () => {
    store.addVote(MOCK_CANDIDATES[1]);
    expect(store.state.candidates[0].votes).toBe(0);
    expect(store.state.candidates[1].votes).toBe(6);
  });

  it('should correctly add a candidate', () => {
    store.addCandidate('Test candidate 3');
    expect(store.state.candidates[2].name).toBe('Test candidate 3');
  });
});

Conclusion

The purpose of this post was to present how one can leverage the built in features of Angular framework to implement a simple yet powerful state management solution. The provided Coffee election example app is very simple, but the concepts it demonstrates can be used to successfully manage state in much bigger and more complex apps. At Zemanta we used observable store services to implement a rather complex feature and since the experiment worked out great we will continue to use such stores in our app going forward.

Zemanta's Student Programming Challenge 2017

The University of Ljubljana’s Faculty of Computer and Information Science and Zemanta are co-hosting a programming competition, with a chance to win great prizes - Google Pixel 2, Google Home Max and the opportunity for a paid internship at Zemanta.

This year’s assignments are:

  • Alternative Ad Title Generator
  • Real-time Business Intelligence Analytics

Alternative Ad Title Generator

Everyone likes interesting and clickable ad titles as they are more intriguing to web users, help getting advertisers’ message across and generate more revenue for publishers. The goal of this assignment is to create an Alternative Ad Title Generator which takes the existing title and comes up with meaningful alternatives which maximize ‘clickability’.

We are looking for a teachable system which takes ad title as input, and outputs 5 automatically-generated alternatives. Outputs have to be meaningful and preferably retain the context of the original title.

You’re required to prepare your own ad title learning datasets by scraping the web or by using a public database. We’ll provide a dataset of words and their average click-through rates (CTRs) which you can use to help train your generators. We’ll also include a sample of titles, some of which will be used for assessing the results.

Your solution must be open source and contain the code for both learning and predicting. Additionally, provide the dataset used for teaching the model and a short report, clearly describing both setup instructions and your approach to solving the problem.

Your program will be tested with a subset of titles from the provided sample. We’ll submit the generated alternative titles to our Zemanta One platform, exposing your work to some of the largest publishers on the web like cnn.com, bbc.com or foxnews.com. In addition, our marketing experts will look at your generated titles and score them based on clickability and meaningfulness.

Real-time Business Data Analytics

Every time we’re browsing the web, advertisers are competing for our attention. They bid against each other for ad space offered by websites via ad exchanges. This process is called real-time bidding. Each pageview announces the start of an auction - bidders are notified that bidding is about to happen with a bid request, they respond with suitable ads and a price, the winning ad is selected by the exchange and shown to the viewer, and the winning bidder is notified of his win. After that, the viewer might or might not click on the ad, and he might or might not buy the offered item. All of these events are valuable business metrics and are therefore stored in logs. Now imagine millions of users browsing the web at the same time.

The goal of this assignment is to build a robust and scalable service that enables its users to run ad-hoc queries against such huge but well-defined amounts of data close to real-time.

You’re required to choose the best storage solution, build a pipeline that consumes and stores the data, and provide a simple programmable interface for querying that will be able to return results in just a few seconds on a consumer laptop. You can use any suitable open-source storage solution in your service.

We’ll provide a script that simulates different services by generating random logs and some example queries.

Your solution has to be open-source and it has to include setup and query instructions. Dockerized solutions are a big plus. Prepare a short report where you describe the architecture of the system and include some benchmarks. We’ll test your service by continuously inserting thousands of logs per second, evaluate the results of a set of non-disclosed queries and measure the response times.

Application

Apply to the challenge online by sending an email to studentski-izziv@zemanta.com or by filling out a form on Učilnica FRI 17/18. The application deadline is 11/6/2017!

Assignments will be presented on 10/25 at 13:15 at Garaža, FRI.

Assignment submission

Assignments have to be submitted to Učilnica FRI 17/18 by 12/11/2017. The award ceremony will take place on 12/21/17 at Zemanta.

For more information about the competition and the assignments visit Učilnica FRI 17/18 or send an email to studentski-izziv@zemanta.com.

Detecting Suspicious Websites

It’s every advertiser’s worst nightmare: advertising on a seemingly legitimate site only to realize that the traffic and/or clicks from that site are not the coveted genuine human interest in the ad. Instead they find fake clicks, unintentional traffic or just plain bots. In a never-ending quest for more ad revenue, some website publishers scramble for ways to increase traffic. And not all approaches are as respectable as improving them site’s content, its readability and SEO.

Some publishers try to boost their stats by buying cheap traffic. Unfortunately, such traffic is often of bad quality. Many sellers that purport to drive good traffic to a site at a extremely small cost actually use botnets to create that traffic.

Botnets can often be detected by looking at which sites share a common set of visitors. Of course, almost all websites share some of their visitors, but this percentage is small. Moreover, as the site accumulates more visitors, the probability of a large overlap occurring by chance becomes infinitesimal. To make things harder a botnet can, among the suspicious sites, visit several well-known and respected websites, so that the apparent credibility of the malicious sites is artificially boosted.

The key is that these botnets re-use the same machines for different “customers”, which we may be able to leverage. The question is thus, can we identify websites infested with such traffic? And if so, then how? The answer to the first question is yes, and to the second is this blog post.

Covisitation graphs

Our problem lends itself nicely to a network approach called a covisitation graph. We will construct a graph, such that the sites that share traffic will be tightly connected. Especially if visitors are shared between several sites, as is usually the case. We can represent all of the websites as vertices (nodes) in a graph. To encode the overlap between a pair or sites, we will use directed edges weighted between 0 and 1. This number will be the proportion of visitors from the source of the edge that were also seen on the target edge. An example is shown in Fig 1. Let A have 200 visitors, B have 1000 visitors, and the shared traffic be 100 visitors. The edge from A to B is the overlap divided by A’s visitors, or 100/200 or 0.5. Similarly, from B to A we have 100/1000 or 0.1. Of course, a network is no network if it’s completely connected. We must set an appropriate threshold to remove edges but keep the ones that are important. A simple threshold of 0.5 proved most useful in our case.

Fig. 1. Covisitation edge calculation example.

Clustering coefficient

To get the full network, we must repeat this calculation for all pairs of edges. This network is useful for more than just computation, as a quick visual inspection reveals many interesting things. Tight clusters of websites that are only loosely connected to the core of the graph are precisely the aberrations we would expect when several sites band together. In graph theory, they are called cliques, and have been well researched. A well-known measure of “cliqueness” is the clustering coefficient, shown in Fig. 2.

Fig. 2. A visual representation of the clustering coefficient. Fig. 2. A visual representation of the clustering coefficient. Image courtesy of Santillan et al.

It is a number between 0 and 1, with 0 meaning that none of the node’s neighbors are connected with each other, and 1 meaning the opposite (i.e. the node and its neighbors form a clique). We can combine this number with the probability of such a cluster occurring randomly to get a sensible measure of suspicion for each site. Thus, a small clique of 2-3 sites is somewhat likely to occur, but a clique of 20 nodes warrants a closer discerning look.

Fig. 3. Covisitation graph of all websites with at least 100 unique visitors over a 3-day span. Node size logarithmically proportional to number of visitors. Fig. 3. Covisitation graph of all websites with at least 100 unique visitors over a 3-day span. Node size logarithmically proportional to number of visitors.

Results

The circled areas in Fig. 3 indicate exactly these occurrences. We have several groups of websites with dissimilar content where they all share each other’s traffic. One of the suspicious cliques is completely detached from the rest of the graph.

On one hand, there are going to be a lot of naturally occurring edges in the graph. Overlaps as high as 50% are not uncommon between websites of similar content and audience. On the other hand, sites should have very few outgoing edges, especially if they are popular and have many visitors. Popular sites may have many incoming edges, but as long the sources of these edges are not connected with each other it is not suspicious.

Thus, out of the vast interconnected graph that is the world wide web, we can, with precision, pick out the few bad apples that are trying to rake in revenue unfairly and at the detriment of everyone else. We mark such sites as suspicious. This list is then used to avoid buying impressions from these sites and analysing the sources of this traffic. The whole approach outlined in this blog post is one of the tools in Zemanta’s arsenal for combating invalid traffic, where the focus lies in bringing quality traffic for the best possible price.

The analysis was part of my 3-month internship at Zemanta, where I worked on detecting this type of suspicious traffic at scale. The work I’ve done was a breeze with such awesome support from the whole team. It was a great learning experience and an invaluable opportunity to work on a codebase that scales to so many requests per second. Personally, I believe that this bridging of the gap between research and industry benefits both parties immensely, and I hope to do more such collaborations in the future.

Further reading

The original covisitation approach was first used in:

Stitelman, Ori, et al. “Using co-visitation networks for detecting large scale online display advertising exchange fraud.” Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013,

where our more inquisitive readers can find additional information about this approach.

How I Learned Go

Introduction and Motivation

Earlier this year I joined Zemanta, which as a Native Demand Side Platform operates in the real time bidding business. We bid on auctions for native ad slots when users are consuming content. When we win an auction, our client’s ad is shown to a user in a specific slot on a web page. You can imagine we process a lot of impressions per second. Zemanta chose Go to build the whole infrastructure around real time bidding, instead of high speed enterprise languages such as Java or C++/C#. Go performs well compared to Java in terms of performance, plus its GC pause is shorter on average. Technical specifications were not the main point why Zemanta went with Go, but rather because of its fast and simple development process.

Go is a relatively new programming language. As many new tech inventions, Go was created as an experiment. The goal of its creators was to come up with a language that would resolve bad practices of others while keeping the good things. It was first released in March 2012. Since then Go has attracted many developers from all fields and disciplines.

This short introduction of Go should serve as motivation for further reading. The rest of this blog post will focus on what were my steps in learning Go and some interesting patterns Go uses to efficiently solve specific problems.

Getting Started

For me learning Go started by checking Go’s official webpage. There I found installation files, tutorials, documentation, etc. I had no clue where to start, so I asked a couple of experienced Go coders. All suggested me to start with A Tour of Go. This interactive Go tutorial is great to get familiar with Go’s core ideas and syntax. I didn’t even have to install anything on my computer. All exercises can be done in an interactive web “interpreter”. I finished this tutorial in a couple of days and with that I got an idea of how things like variable initialization, pointers, structs, lists, etc. are done in Go.

I really couldn’t say I understood Go very well at that point. Still, together with my experience from other programming languages, it gave me enough knowledge to do my first task at Zemanta. I had to refactor a Hadoop MapReduce job. MapReduce is made of two functions (mapper and reducer), each accepting lines on stdin and printing lines to stdout. You can pass a mapper and a reducer as binaries to Hadoop infrastructure and it will run a MapReduce job. Go is perfect for this since it compiles into a single binary, plus you can compile it for any platform from any platform. I am working on a Mac and Hadoop usually runs on Linux, so I just had to type

GOOS=linux go build

in my terminal and upload the files to our Hadoop cluster. Cross platform build? With all the dependencies and infrastructure to run the program in one binary file? “I could get used to that”, was my first thought after checking other Go tools that come out of the box when you install Go.

No More Exceptions

By this point, I got familiar with the syntax, as well as another important part of writing big IT systems: error handling. Go’s simple approach to error handling looked too good to be true at first. In Go, functions usually return a value and an error (functions can return multiple values), so handling errors is as simple as checking the second return value. This means errors get passed around until they are finally logged, printed or cause another action. This gives a good way to chain errors from one place to another. Sounds nice, but sometimes it just doesn’t look clean if 80% of your code checks for errors. On the other hand, it forces developers to handle errors, which can avoid some critical runtime errors.

func mapper(w *io.Writer, r io.Reader) {
	in := bufio.NewReader(r)

	for {
		line, err := in.ReadBytes('\n')

		if err == io.EOF {
			break
		} else if err != nil {
			log.Error(err)
		}
		if len(line) == 0 {
			continue
		}

		// do something with the line

		err = w.Write(line)
		if err != nil {
			log.Error(err)
		}
	}
}

After checking all of the official resources, I started searching for other posts about Go programming patterns. I came across Go by Example which is really similar to A Tour of Go, but with a bit more line by line explanations and is organized really well. Every programming language has many ways of expressing something, similarly there are many tutorials explaining the same thing in a different way. Go by Example helped me a lot with real life examples and how to solve them. Another similar resource I came across at the same time, which is actually part of the official documentation, was Effective Go. Both helped me make my code look more “Goish”.

Implicit is Great

Go by Example explained some, at that time, strange patterns, one of them being the use of interfaces I was seeing everywhere. Why was everyone so excited about them? For example, why do we need a reader interface with only read method signature?

type Reader interface {
    Read(p []byte) (n int, err error)
}

A waste of code in my opinion. Our struct that wants to read something will obviously implement read method. Then it finally clicked.

Let’s see why this is so neat on a practical (maybe a bit long) example.

func printContent(r io.Reader) error {
	contentBytes, err := ioutil.ReadAll(r)
	if err != nil {
		return err
	}
	fmt.Println(string(contentBytes))
	return nil
}

func main() {
	f1, err := os.Open("./test")
	if err != nil {
		// do error handling
	}
	defer f1.Close()

	response, err := http.Get("http://www.zemanta.com/")
	if err != nil {
		// do error handling
	}
	defer response.Body.Close()

	f2, err := os.Open("./testZip")
	if err != nil {
		// do error handling
	}
	defer f2.Close()

	z, err := gzip.NewReader(f2)
	if err != nil {
		// do error handling
	}
	defer z.Close()

	printContent(f1)
	printContent(response.Body)
	printContent(z)
}

When you understand interfaces in Go, you understand interfaces in general. I don’t think interfaces are hard to understand, it is just that in Go they are used to do exactly what they were designed to do in any programming language - logically group code that has similar behavior, not to achieve less code, which is a consequence. In the above example, we can pass any struct that implements a ReadAll(r io.Reader) method to printContent(r io.Reader) instead of writing a handler for each specific case. We also don’t need to explicitly say those structs are implementing reader interface. This is great for us humans, but not for our IDE which needs to figure out what methods should a struct implement.

Two-value Assignment

Another cool thing about interfaces is that every type in go implements at least zero methods, which means each type implements a special zero interface{}. This is useful when we do not know the exact type of a variable.

var data interface{}
if err := json.Unmarshal(raw_bytes, &data); err != nil {
	log.Error(err)
}

How do we get a type from this? With type assertion. Type assertion in Go is a simple statement. If assertion fails, then the assigned value will have the default value of the type. Another way to do it is to assign a second value in type assertion statement which will hold a boolean telling you if assertion was successful. If your value can be the same as default for this type, two-value assignment is the only way to properly do type assertion.

jsonStruct := data.(map[string]interface{}) // this can fail
num := jsonStruct["num"].(int)
str, ok := jsonStruct["str"].(string)
if !ok {
	log.Error("Couldn't convert to string")
}

The same applies for retrieving values from maps.

m := make(map[string]int)
m["Answer"] = 42
v, ok := m["Answer"]
if !ok {
	log.Info("Could not get value for key Answer")
}

The above example implies how we use sets in Go. Since there is no special set struct defined, we make use of maps.

idSet := make(map[string]struct{})
idSet["id1"] = struct{}{}
idSet["id2"] = struct{}{}

if _, ok := idSet["id1"]; ok {
	log.Println("id1 is in idSet")
} 

if _, ok := idSet["id3"]; !ok {
	log.Println("id3 is not in idSet")
}

The Great Go Book

This brings me to the last resource I used on my way to Go fluency - The Little Go Book. It is written for readers who are already comfortable with Go syntax. The book gave me another perspective on how to write idiomatic Go. It was exactly what I needed to sink in the knowledge that I accumulated over the past few weeks. It reassured me I was on the right path. While reading this book I had a lot of “Ahaaaa” moments, usually around the topics described above, and after reading it, the real fun began.

Conclusion

The whole process of going through tutorials, examples, reading posts and books took me around 5 weeks. I had programming experience before I started learning Go and I already worked with pointers and interfaces in other programming languages. During those 5 weeks I figured out a lot of general programming concepts I already heard of but never quite understood and it’s all thanks to Go. I encourage the reader to learn Go not because it is really cool and hype right now, but because you will learn a lot about programming and software development in general. Just think about the fact that Go was built on 40 years of programming experience and best practices. Technology has changed a lot in those years and so should an interface we use to manipulate it.

First steps with Amazon Redshift

At Zemanta we have been using Amazon Redshift as our database of choice for campaign traffic statistics. We strive to deliver query results for our analytics view to our users close to real-time. This view is one of the core screens that our users use to manage their campaigns and see how well they perform, so it needs to be in shape.

Amazon Redshift is a great piece of technology but it does not magically produce good results. In fact, performance between an ad-hoc design and an optimized one can differ greatly. Table design and queries that you want to execute need to be tailored to how its internals work best and nourished along the way, when new features get developed and data grows in size. Only then we can get satisfactory results.

In this article we try to sum up the basics of efficient Amazon Redshift table design, and provide some resources that got us going in the right direction.

The basics

Amazon Redshift is a managed data warehousing solution whose main promise is that it can offer fast analytical queries on big data through extensive use of parallelism. It can load and unload big chunks of data directly from or to Amazon S3 (or other AWS storages) with which it integrates seamlessly. This is a big plus if your infrastructure resides in the AWS cloud as ours does. It can effectively do large data transformations and so can be used for your ELT (Extract, load, transform) pipeline. This is a great article from Panoply that compares ELT and ETL - the more common way of transforming data before loading it.

One of the common first impressions is that it’s actually some kind of PostgreSQL database because you can communicate with it with tools that are made for PostgreSQL, e.g. psql. It actually emerged from it, but as we will see, it is really far from it and should be treated differently.

Cluster

When you start using Redshift, you provision a cluster. A cluster consists of a leader node and multiple compute nodes. These are further subdivided into slices.

The leader receives queries, optimizes them, creates query plans and coordinates the work amongst its nodes. A node provides CPU, memory and disk for query execution and data storage. Memory and storage are proportionally allocated to its slices. A slice executes a single query segment within one process at a time.

Amazon Redshift cluster arhitecture Amazon Redshift cluster arhitecture.

More on cluster arhitecture and query execution workflow can be found in the official documentation here and here.

Data storage

The way data is stored and organized within a cluster directly influences how tables and queries should be designed for them to be performant.

The four most important concepts of data organization are: data distribution, sort keys, zone maps and compression. Here we will cover them briefly and provide some additional resources where you can research them in more depth.

Distribution

Distribution determines how data is distributed among slices. With it we define by what strategy should records be grouped together. By that we put data where it needs to be beforehand and so reduce the need for query optimizer to move data between nodes to efficiently query it. This comes in handy when we join data.

In the following example of data distribution we see traffic and conversion records grouped by campaign. When we join such data by campaign, no records need to be redistributed between nodes as they can be joined within a single slice. If distribution would be different and we would want to make the same join, records would need to be relocated temporarily before a query would execute.

Data collocation Example: Data collocation.

With a proper distribution strategy we also try to evenly load nodes so that parallel processes have similar amount of work to do. The end processing time is determined by the slowest process. A node that has disproportionate amount of work to do, will thus directly influence the overall query execution time.

The consequence of uneven data grouping is skew (Skewed table data). In such cases some slices contain more data than others and can therefore slow our queries and cause premature fill up of storage on some nodes.

Skew Example: Skew.

Sort keys and zone maps

Redshift stores data in columns. This means that blocks of values from the same column are stored on disk sequentially as opposed to row based systems where values from a row follow sequentially.

Sequential values of a column are partitioned within 1MB blocks that represent the smallest data organizational unit. Redshift will load an entire block to fetch a value from it, so it is important for us to understand how it decides to select a block. This way we can minimize I/O operations.

Sort keys set the order of records within blocks and tables. It behaves the same as ORDER BY in queries.

Zone maps are meta data about blocks, they define MIN and MAX value within them. Using sort key and zone map information Redshift can skip blocks that are irrelevant for a particular query.

In the following example we will try to illustrate the difference in the amount of I/O required to locate the offset of relevant records. We defined a table that has clicks grouped by date and campaign id. We have ordered the data by date. To query clicks by campaign for the date 2017/02/12, Redshift will be able to find relevant records by loading and reading only the 1st date column block. On the other hand, when we would like to get clicks for campaign id 500, 2 blocks will have to be loaded and read to get the record offset for other columns.

CREATE TABLE stats (
    date date not null,
    campaign_id int2 encode lzo,
    clicks integer encode lzo
) sortkey(date)

Skew Example: Sort keys and zone maps.

When adding data to our tables we need to maintain the sort and meta data. We can either add data in correct order or we can use VACUUM to resort and ANALYZE to update meta data and table statistics.

Compression

Due to Redshifts columnar nature it can efficiently compresses the data it holds. This means that the amount of data within a 1MB block can be increased and the overall storage footprint reduced. This can save us some I/O because less data needs to be read from a disk. Because it is columnar it can also independently of other columns shrink or grow records within one column.

In the Sort keys and zone maps example we can see different storage requirements for the date and campaign id columns. We see that the same amount of values for dates took 3 blocks, and only 2 blocks for campaign ids. This happened because the compression of the campaign column was better.

By default no compression is applied to data. A good first start is the command ANALYZE COMPRESSION that we execute on already loaded data and it will give us recommended encodings based on a sample. If we have some knowledge about our data we can also select encodings ourselves. Available algorithms and when to use them are listed here.

Code compilation

Not related to the way data is stored but important concept used by Amazon Redshift is code compilation.

When Redshift receives a query, it will optimize it and generate a query plan. Execution engine will take that query plan and generate compiled C++ code from it. After that it will put it into a LRU cache. The compilation step can take a while and this is why you will notice that the 1st run of a query is always, sometimes by an order of magnitude, slower than the following runs of the same query. Later runs of the same (or similar) query will take the compiled code from cache and thus the compilation step will be skipped.

A query won’t need to be recompiled if only some filter parameters get changed. On the other hand, Redshift will recompile a query if everything is the same but different or additional columns are selected. Unfortunatelly not much is documented about what changes have an effect on recompilation. In this article the author tries to experiment a bit and find some patterns.

Cases when we need to be aware of this behaviour are benchmarks and dynamic query building. When we do benchmarks we always ditch the first query execution time as it will always be an outlier. Where we do dynamic query building we try to reduce the number of unique queries. One way we do that is that we always query all columns that are available for a given view. We strip the ones that we won’t show later, e.g. based on a permission. This way a query for the same view but two users with different permissions will be the same or similar enough that it won’t need to be recompiled.

Conclusion

We have been using Amazon Redshift for more than a year now and we are pretty comfortable with it by now. We think the concepts described in this article will give you a good start with fundamentals of optimal table design that are needed for a beginner in this field.

We realize there is a lot more to it, so we selected a few more resources that helped us, and we are sure will help you, on going further and deeper on it in the last chapter Additional resources of this article.

Additional resources