c++-gtk-utils
Public Member Functions | List of all members
Cgu::Utf8::Reassembler Class Reference

A class for reassembling UTF-8 strings sent over pipes and sockets so they form complete valid UTF-8 characters. More...

#include <c++-gtk-utils/reassembler.h>

Public Member Functions

Cgu::SharedHandle< char * > operator() (const char *input, size_t size)
 
size_t get_stored () const
 
void reset ()
 
 Reassembler ()
 

Detailed Description

A class for reassembling UTF-8 strings sent over pipes and sockets so they form complete valid UTF-8 characters.

Utf8::Reassembler is a functor class which takes in a partially formed UTF-8 string and returns a nul-terminated string comprising such of the input string (after inserting, at the beginning, any partially formed UTF-8 character which was at the end of the input string passed in previous calls to the functor) as forms complete UTF-8 characters (storing any partial character at the end for the next call to the functor). If the input string contains invalid UTF-8 after adding any stored previous part character (apart from any partially formed character at the end of the input string) then operator() will return a null Cgu::SharedHandle<char*> object (that is, Cgu::SharedHandle<char*>::get() will return 0). Such input will not be treated as invalid if it consists only of a single partly formed UTF-8 character which could be valid if further bytes were received and added to it. In that case the returned SharedHandle<char*> object will contain an allocated string of zero length, comprising only a terminating \0 character, rather than a NULL pointer.

This enables UTF-8 strings to be sent over pipes, sockets, etc and displayed in a GTK+ object at the receiving end

Note that for efficiency reasons the memory held in the returned Cgu::SharedHandle<char*> object may be greater than the length of the nul-terminated string that is contained in that memory: just let the Cgu::SharedHandle<char*> object manage the memory, and use the contents like any other nul-terminated string.

This class is not needed if std::getline(), with its default '\n' delimiter, is used to read UTF-8 characters using, say, Cgu::fdistream, because a whole '\n' delimited line of UTF-8 characters will always be complete.

This is an example of its use, reading from a pipe until it is closed by the writer and putting the received text in a GtkTextBuffer object:

using namespace Cgu;
GtkTextIter end;
GtkTextBuffer* text_buffer = gtk_text_view_get_buffer(GTK_TEXT_VIEW(text_view));
gtk_text_buffer_get_end_iter(text_buffer, &end);
Utf8::Reassembler reassembler;
const int BSIZE = 1024;
char read_buffer[BSIZE];
ssize_t res;
do {
res = ::read(fd, read_buffer, BSIZE);
if (res > 0) {
SharedHandle<char*> utf8(reassembler(read_buffer, res));
if (utf8.get()) {
gtk_text_buffer_insert(text_buffer, &end,
utf8.get(), std::strlen(utf8));
}
else std::cerr << "Invalid utf8 text sent over pipe\n";
}
} while (res && (res != -1 || errno == EINTR));

This class maintains an array as a data member, containing partly formed characters from previous calls to operator(), and should not be copied. There should be no reason to do so, but unfortunately enforcing this by explicitly precluding copy construction and copy assignment was overlooked when this class was first provided. At the next API break, the copy constructor will be explicitly deleted and moving only allowed. Where a Reassembler object is to be moved, use std::move and the code will be safe against this change in the future.

Constructor & Destructor Documentation

Cgu::Utf8::Reassembler::Reassembler ( )
inline

The constructor will not throw.

Member Function Documentation

size_t Cgu::Utf8::Reassembler::get_stored ( ) const
inline

Gets the number of bytes of a partially formed UTF-8 character stored for the next call to operator()(). It will not throw.

Returns
The number of bytes.
Cgu::SharedHandle<char*> Cgu::Utf8::Reassembler::operator() ( const char *  input,
size_t  size 
)

Takes a byte array of wholly or partly formed UTF-8 characters to be converted (after taking account of previous calls to the method) to a valid string of wholly formed characters.

Parameters
inputThe input array.
sizeThe number of bytes in the input (not the number of UTF-8 characters).
Returns
A Cgu::SharedHandle<char*> object holding a nul-terminated string comprising such of the input (after inserting, at the beginning, any partially formed UTF-8 character which was at the end of the input passed in previous calls to the functor) as forms complete UTF-8 characters (storing any partial character at the end for the next call to the functor). If the input is invalid after such recombination, then a null Cgu::SharedHandle<char*> object is returned (that is, Cgu::SharedHandle<char*>::get() will return 0). Such input will not be treated as invalid if it consists only of a single partly formed UTF-8 character which could be valid if further bytes were received and added to it. In that case the returned Cgu::SharedHandle<char*> object will contain an allocated string of zero length, comprising only a terminating \0 character, rather than a NULL pointer.
Exceptions
std::bad_allocThe method might throw std::bad_alloc if memory is exhausted and the system throws in that case. It will not throw any other exception.
void Cgu::Utf8::Reassembler::reset ( )
inline

Resets the Reassembler, by discarding any partially formed UTF-8 character from previous calls to operator()(). It will not throw.


The documentation for this class was generated from the following file: