Previous slide Next slide Toggle fullscreen Open presenter view
German strings
A case for yet another string type
Dmytro Shynkar
dmytroshynkar@gmail.com
StockholmCpp 0x38
Plan
Intro and performance promises
Implementation details
Some use cases
Name context
We love strings...
... and we have some already!
<const> char* - a C string, a NULL-terminated char array
std::basic_string<> - owning mutable string, convertible to a C string, usually has SSO and takes up 32 bytes
std::basic_string_view<> - a view into some string data, may not be null terminated
Your favorite custom implementation
We love strings...
So why would we need any more?
50-100% faster equality and comparison operations
2x smaller memory footprint for the string object
Anatomy of std::string
class string
{
size_t size;
size_t capacity;
char* data;
};
Anatomy of std::string
A bit closer to the truth
struct string
{
char * ptr;
size_t size;
union {
size_t capacity;
char buf[16 ];
};
};
struct string
{
union {
char * ptr;
char buf[16 ];
};
size_t size;
size_t capacity;
}
union string
{
struct {
size_t capacity;
size_t size;
char * ptr;
} large;
struct {
unsigned char is_large:1 ;
unsigned char size:7 ;
char buf[sizeof (large) - 1 ];
} small;
}
Anatomy of std::string
A common thing - SSO and a 32 byte size
Pictured here: libstdc++(GCC)
Anatomy of std::string
Always owning the memory
Mutable
Local for 15 to 23 byte-long strings
Null-terminated
std::string_view for non-owning strings
Just a pointer and a size, 16 bytes
What's lacking?
If we are not bound by the standard and have a specific use case in mind, we can optimize for:
Equality and ordering operations
Avoiding copies where possible
Fast small strings
std::string is usually good at the last one
What's lacking?
Is it not optimal now?
Obligatory Godbolt screenshot slide
What's lacking?
Memory latency!
Obligatory CPU cache and memory latency slide
What's lacking?
What is an appropriate type here?
std::vector<StringType> GlobalStringTable;
void addFromANetworkRequest (const NetworkRequest& request)
{
GlobalStringTable.push_back (request.string ());
}
void addFromAMmap (const MmapRegion& region)
{
GlobalStringTable.push_back (region.string ());
}
void addStatics ()
{
GlobalStringTable.push_back ("static_string" );
}
What's lacking?
std::string is large - 32 bytes
Suffers under the default calling conventions
Takes up precious cache space
What can we do differently?
Always have a prefix locally
Sometimes own the data
Immutable
12-byte SSO
4 GBs size limit
No null termination
What can we do differently?
Sometimes owning the data - string classes
Why have separate string and string_view types at all?
By tagging the pointer with 2 bits of data we can assign each string to a class
temporary - usual owning string
persistent - always valid string
transient - caller defined lifetime
Sometimes owning the data - string classes
Pros
Avoids unnecessary copies
Clear runtime values for lifetimes
Cons
A new footgun
Relies on the canonical pointer form and unused bits
Prefix
Possibly avoid the cache miss on comparison
Neatly becomes the prefix for the SSO string
We need to touch the string memory when we are creating a view to it
Show me the code!
struct german_string
{
std::uint32_t size;
union {
char small[12 ];
struct {
std::uint32_t prefix;
char * data;
} large;
};
};
Clean translation of the diagram
Does not work!
Show me the code!
struct german_string
{
std::uint64_t state[2 ];
};
Show me the actual code!
Equality
bool equals (const german_string& other) const
{
if (_state[0 ] != other._state[0 ])
{
return false ;
}
if (static_cast <std::uint32_t >(_state[0 ]) <= 12 )
{
return _state[1 ] == other._state[1 ];
}
const char * my_ptr = reinterpret_cast <const char *>(_state[1 ] & ~PTR_TAG_MASK);
const char * other_ptr = reinterpret_cast <const char *>(other._state[1 ] & ~PTR_TAG_MASK);
return std::memcmp (my_ptr, other_ptr, _get_size()) == 0 ;
}
Show me the actual code!
Compare
int compare (const german_string& other) const
{
const uint32_t min_size = std::min (_get_size(), other._get_size());
const uint32_t min_or_prefix_size = std::min (min_size, sizeof (uint32_t ));
const int prefix_cmp = prefix_memcmp (_get_prefix(), other._get_prefix(), min_or_prefix_size);
if (min_or_prefix_size == min_size || prefix_cmp != 0 )
{
return prefix_cmp != 0 ? prefix_cmp : (_get_size() - other._get_size());
}
const int result = std::memcmp (
_get_maybe_small_ptr() + 4 ,
other._get_maybe_small_ptr() + 4 ,
min_size - min_or_prefix_size
);
return result != 0 ? result : (_get_size() - other._get_size());
}
Show me the actual code!
Bit magic!
int prefix_memcmp (std::uint32_t a, std::uint32_t b, int n)
{
std::uint32_t diff = a ^ b;
std::uint32_t mask = (std::uint32_t )((0xFFFFFFFFull << (n * 8 )) >> 32 );
diff &= mask;
int first_diff = std::countr_zero (diff) / 8 ;
std::uint8_t byte_a = ((std::uint64_t )a >> (first_diff * 8 )) & 0xFF ;
std::uint8_t byte_b = ((std::uint64_t )b >> (first_diff * 8 )) & 0xFF ;
return (int )byte_a - (int )byte_b;
}
Around 2x faster than std::memcmp!
Looking at benchmarks
Clang + libc++
GCC + libstdc++
MSVC + STL
Applications - 1BRC
The plan:
Find a simple open source C++ implementation
Replace all std::string with gs::german_string
PROFIT!
Applications - 1BRC
struct Record { int cnt; float min, max, sum; }
std::unordered_map<StringType, Record> process_input (std::span<const char > data)
{
std::unordered_map<StringType, Record> db;
StringType station, value;
while (getline (data, station, ';' ) && getline (data, value, '\n' ))
{
float fp_value = stof (value);
auto it = db.find (station);
if (it == db.end ())
{
db.emplace (station, Record{1 , fp_value, fp_value, fp_value});
continue ;
}
update_record (it->second, fp_value);
}
return db;
}
Applications - 1BRC
Applying the same thing to the optimized code
We are ~10% slower...
Hashing-heavy, small-key set
A conditional on loading the data pointer hurts
Applications - Toy LSM-tree KV-store
Think HBase, LevelDB, RocksDB, etc. - but smaller and simpler
Memory-mapped storage, single level of compaction
Applications - Toy LSM-tree KV-store
template <typename StringType>
class LSMTree
{
std::map<StringType, StringType> memtable_;
std::vector<std::unique_ptr<SSTable<StringType>>> sstables_;
std::string base_dir_;
int next_sstable_id_;
};
template <typename StringType>
class SSTable
{
private :
std::string filename_;
mutable std::vector<std::pair<StringType, StringType>> data_cache_;
mutable bool cache_loaded_;
mutable std::unique_ptr<MappedFile> mapped_file_;
};
Applications - Toy LSM-tree KV-store
std::optional<StringType> LSMTree::get (const StringType &key)
{
auto result = memtable_.get (key);
if (result.has_value ())
{
return result;
}
for (auto & sstable : std::views::reverse (sstables_))
{
result = sstable->get (key);
if (result.has_value ())
{
return result;
}
}
return std::nullopt ;
}
Applications - Toy LSM-tree KV-store
A lot of string comparisons happen in both write and read paths
We can avoid copying data in addition to <=> performance boost
Applications - Real World
Most high performance OLAP systems have them!
DuckDB
Apache Arrow & DataFusion
Polars
Velox
Applications - Real World
DataFusion posts
Faster parsing
Significant improvement in E2E query performance
Thank you for your attention
Questions?
note: If you only want to know what this string type has to do with Germany you'll have to wait till the end ;)
`<const> char*` - may be heap, may be stack, may be static, requires strlen to do anything with but takes up just 8 bytes
in this talk I talking strictly about modern consumer computers which are predominantly 64bits and little endian
note: Based on totally trust-worthy benchmarks!
latest clang and libstdcxx
todo: maybe log scale for Y?
note: While all of the 3 big implementations make different tradeoffs while staying in the bound of std::string interface they share a lot in common
note: We check the size but then basically get the data pointers and call memcmp. For the case of a non-small string it might be bad.
I'm very grateful for how simple Libstdc++ implementation is as the assembly of both msvc and libc++ is uch longer and more complex due to their SSO representations
note: We will go over each part now
note: Will take up 24 bytes, because the union will want to be aligned
note: Rely on little endian
note: This is approximately 2x faster than std::memcmp for 4 byte values
branchless, may be counted as SWAR maybe?
2023 viral data aggregation challenge
note: Very fresh a definitely not biased. Aggregating a large volume of data from a CSV
note: We basically took unoptimized version that did copy strings from the file to std::string
note: Optimized version already used string_views and rarelly compared them, strings are city names and thereffore are mostly small
note: memory mapping is inadvisable
note: Because of this guy